Why do people focus on standardized mean difference, K-S tests, etc., to evaluate propensity score models?

The literature on evaluating propensity score model specification tends to focus on the use of distributional difference metrics (e.g., Kolmogorov-Smirnov) and standardized mean differences. Yet when I build a propensity score model, I usually use the predicted propensity scores to construct stabilized inverse propensity score weights. For this reason, my instinct for evaluating the propensity score model would be to look at the optimism-adjusted calibration curve (if p << n) and the fine-grained histogram of the predicted propensity scores, and maybe also at the discrimination accuracy. For comparing across different models, I’d use test statistics based on the log score, and for comparison across un-nested models (or even nested models), I’d look at their log scores or Brier scores.

1 Like

Why indeed. But I think the literature on how much overfitting you can afford when doing propensity weighting is probably not sufficient at present. We do know however that the uncertainty in the weights needs to be formally taken into account and most practitioners don’t do that.


How would you account for uncertainty in the weights? What is the procedure you would use?

I’ve heard people talking about it but haven’t looked into it myself. It may require Bayesian modeling but I’m not sure. To me this complexity is a reason to not use propensity scores.

Is it not because the ultimate aim is covariate balancing so that is why a standardised mean difference tells us if the derived weight has adequately achieved this? Weights with good statistical properties may not be useful if there was no real balance achieved?

The purpose of weighting is not to achieve balance in the simple sense of the word. Weighting, which is problematic in any context in my view, overemphasizes persons who were unlikely to be treated, which causes underemphasis of those who were likelihood to be treated, thereby causing an important loss of effective sample size thus loss of statistical efficiency. But this could be said to be balancing in effect.

Standardized mean differences are not smart. If a propensity score model is properly fitted, all the variables must be balanced by definition if you only look at one-variable-at-a-time balance. So there is no need to look at mean differences. The only reason for any imbalance would be that you treated continuous variables as linear in the propensity score when they should have been modeled nonlinearly. Mean differences also lull researchers into a false believe that the tails of the distribution are not important. Kolmogorov-Smirnov-like assessments can fix that problem; just look at the two empirical CDFs.

If there is an imbalance in age for females but you don’t see an imbalance on either age or sex, that means you should have had age \times sex interaction in the propensity model.

All of this begs the question of why propensity score analysis is done so often. It’s a mystery to me.

1 Like

Hernan and Robins have a good chapter that motivates IPSW in their new book What If. in the same book, they compare model-based IPSW to model-based standardization, describe their key drawbacks, and explain when you might prefer one of the other.

But why use any of them?

You are proposing to just use outcome regression then?

Yes unless you need to add to outcome regression a confounding score (logit PS). If the sample size is ample, you can do regular direct covariate adjustment and ignore PS. If the sample size does not allow for adjustment for all the covariates you think may be related to treatment selection and to outcome, you can use penalized estimation, still within the outcome model framework: https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.7021 . But I see a lot of folks developing PS when the sample size is perfectly adequate for ordinary unpenalized covariate adjustment.


Using regularization in this situation is really interesting, that paper was a good read! Would the same strategy work for time-to-event data, e.g. with the regularized Cox regression functionality in R’s glmnet package?

Yes definitely. Use the L_2 (ridge) penalty.

1 Like