Propensity score matching with missing data


Hi everyone

I’m posting to ask for references regarding conducting a causal inference analysis with missing outcome data. The context is an analysis to estimate the effect of lipid lowering treatment on LDL-cholesterol (as % reduction from baseline) using routine health care data from diagnosis of familial hypercholesterolaemia to 2 years follow-up. We have longitudinal data; however, as it is routine data, not all patients have LDL-cholesterol measurements at baseline (i.e. the diagnosis date) and at follow-up time (2 years after diagnosis date).

We’re thinking about doing multiple imputation to predict the missing LDL-cholesterol given the observed LDL-cholesterol measurements, other lipid tests, characteristics, past and future CV events. To estimate the effect of treatment on LDL-cholesterol, we’re thinking about propensity score matching; the propensity score estimation would include the characteristics at diagnosis which we think determine the LDL-cholesterol reduction and the decision to treat. This means that the PS estimation would include the multiply-imputed LDL-cholesterol at baseline.

I’d be grateful for some advice and/or references on:

  1. Whether it is appropriate and how to conduct the PSM given the missing data. We’re not completely sure that this is (a) a good idea; and (b) if there are specific issues to be aware of. Do you know of any good papers that discuss this?

  2. Whether PSM is the appropriate technique given that we expect that many of the characteristics that affect the % LDL-cholesterol reduction are only relevant for the treated group as they are treatment modifiers (i.e. interaction effects) rather than prognostic (i.e. main effects). Again, do you know of any good papers about this?

Thank you very much in advance!

Best wishes



I hope that others have direct experience with this problem and will respond. This kind of question makes me even more a fan of a unified Bayesian approach that does not need to do any imputation. But I have heard that it may work in a propensity model to have an indicator for missingness of the baseline variable, unlike in an outcome model where the indicator variable approach is a complete disaster. I wish I had a reference from this in the propensity framework.

1 Like


I’m not generally comfortable with the idea of imputing missing outcome data, so I will leave that point to others. Regarding imputing the baseline levels of the outcome variable, I think it will be very important to think about why some patients might not have measurements at baseline and whether any imputation technique is capable of addressing the underlying issue.

I can, however, provide some references about using propensity scores with missing data in general:

  1. Rosenbaum, P. R. (2010). Design of Observational Studies (Springer Series in Statistics). Springer. See pp. 193–194, 240–242

  2. Seaman, S. R., White, I. R., Copas, A. J., & Li, L. (2012). Combining multiple imputation and inverse-probability weighting. Biometrics, 68 (1), 129–137.

  3. Cham, H., & West, S. G. (2016). Propensity score analysis with missing data. Psychological Methods , 21 (3), 427–445.

Would you consider turning your second question into a separate post, with more details?



Thank you very much for your answers so far. I’ll post the 2nd question in a new thread, as suggested.

Thanks for the references, which we will look at. Very good point about the mechanism of missingness. We will think about this.


1 Like


Two further papers I found helpful are:

Wun L-M, Ezzati-Rice TM, Diaz-Tena N, Greenblatt J. On modelling response propensity for dwelling
unit (DU) level non-response adjustment in the Medical Expenditure Panel Survey (MEPS). Stat Med.
2007; 26: 1875–1884. PMID: 17206601

Chen Q, Gelman A, Tracy M, Norris FH, Galea S. Incorporating the sampling design in weighting adjustments for panel attrition. Statistics in Medicine. 2015; 34: 3637–3647.

@f2harrell Could you elaborate a bit more on why Bayesian approaches don’t need to do any imputation?

1 Like


I describe this briefly at the end of the chapter on missing data in RMS. You can use multiple imputation by a flexible posterior sample draw stacking method, or to what I alluded to earlier, you can jointly model the missing values while doing the outcome modeling. This is more appealing. For an example see this.