Why do people focus on standardized mean difference, K-S tests, etc., to evaluate propensity score models?

The purpose of weighting is not to achieve balance in the simple sense of the word. Weighting, which is problematic in any context in my view, overemphasizes persons who were unlikely to be treated, which causes underemphasis of those who were likelihood to be treated, thereby causing an important loss of effective sample size thus loss of statistical efficiency. But this could be said to be balancing in effect.

Standardized mean differences are not smart. If a propensity score model is properly fitted, all the variables must be balanced by definition if you only look at one-variable-at-a-time balance. So there is no need to look at mean differences. The only reason for any imbalance would be that you treated continuous variables as linear in the propensity score when they should have been modeled nonlinearly. Mean differences also lull researchers into a false believe that the tails of the distribution are not important. Kolmogorov-Smirnov-like assessments can fix that problem; just look at the two empirical CDFs.

If there is an imbalance in age for females but you don’t see an imbalance on either age or sex, that means you should have had age \times sex interaction in the propensity model.

All of this begs the question of why propensity score analysis is done so often. It’s a mystery to me.

3 Likes