When using propensity scores to adjust for confounding by indication in observational treatment comparisons, there are many disadvantages of using propensity matching. Besides not accounting for outcome heterogeneity and the non-collapsibility of odds- and hazard ratios [I am never interested in average treatment effects but rather am interested in conditional, i.e., patient-specific effects], propensity matching loses the simple ability to assess whether treatment effect is modified by a specific covariate or by the propensity itself.

Propensity scores are used when the list of possible confounders has a dimensionality that exceeds the number of covariates that can be safely adjusted for using regression. For example, if there are 100 outcome events and 60 degrees of freedom for 50 confounders, regression adjustment fails. By focusing the 50 confounders into a single number that reflects how they affect treatment selection, propensity score is a data reduction technique. But to handle outcome heterogeneity one has to adjust not only for propensity but also for the major outcome predictors. A good default strategy is to adjust for a regression spline in logit propensity plus directly adjust for, say, 5 pre-specified major outcome predictors even though they are also included in the propensity score. Outcome predictors thought to modify the treatment effect can be pre-specified as simple interacting factors.

Some studies have found that when physicians are more sure that a specific treatment should be used in their practice, the treatment has more effectiveness. When the treatment interacts with the logit of the propensity score, and this is not explained by a readily pre-specified individual covariate interaction, is the conclusion as simple as saying the following: Physicians have some generalized non-specific knowledge of which treatment works best when their decisions are more certain (e.g., propensity to treat < 0.2 or > 0.8)? Or is this likely to instead signify a important treatment selection variable that is also an outcome predictor that is missing from the dataset? Or is in the dataset but was only reflected in the propensity model and should have appeared also as one of the separate outcome heterogeneity adjustors? In other words, is it likely we just missed an important interaction with a single covariate when pre-specifying the model?