I’ve laid out some of the issues in Chapter 17 of BBR.
I didn’t understand why the likelihood doesn’t weight things correctly.
Big picture: too many researchers jump to propensity scores when direct adjustment using ANCOVA works fine. Too many researchers use matching as an analytical method for controlling for confounding, not recognizing that most matching algorithms are not reproducible (are sensitive to how the dataset happened to be row-ordered) and are arguably not the best science, when observations are deleted that are relevant. The latter happens very often when “good controls” are dropped because cases needing to be matched with them were already matched with controls appearing earlier in the dataset.
You could say that the bottom line is that matching methods are inefficient, lowering statistical power significantly while still not accounting for outcome heterogeneity. Fully conditional should be the standard, and penalized estimation (Bayesian or frequentist) IMHO provides the “linear” one-step solution to the problem.
Papers have been published in the medical literature (e.g., short-term effects of bariatric surgery for morbidly obese individuals, which was highly criticized) in which > 4/5 of the patients were dropped from the propensity matched analysis. Not because they were not applicable (i.e., PS non-overlap region) but because the matching analysis strategy was deficient.
I’ve seen PS used in examples where there are 80,000 subjects and 20 covariates. What are researchers thinking?
PS hides treatment interactions and I’m not convinced it allows for clear analyses of time-dependent covariates.
I hope that helps to frame the discussion for others to respond.