@f2harrell’s propensity score question re-surfaced this broader question that has come up a few times for me in the last month or so, as I am starting work with a group that does a lot of propensity score matching:
Do you generally prefer PS, or Penalized regression (whether Bayes or frequentist) for causal modeling from non-randomized studies? If so, why?
The best arguments I have heard in favour of PS methods have been focused on simplicity of interpretation/data reduction.
You can create a complex model for your propensity score but clinicians interpreting your actual treatment effect are presented with much more familiar methods that they are used to from RCTs.
The popularity of these methods also seems to make more complex issues (time varying covariates) seemingly more accessible as well.
The best arguments I’ve heard for regression methods are:
Fully conditional which may be beneficial depending on how you use the results (I usually go this way so I can model interactions explicitly, but maybe I’m off base?)
If you’re working within a Bayesian framework, weighting observations through the likelihood is not technically correct since it’s not really a generative model.
There also seems to be a real lack of tutorials for non-statisticians laying out for example: a complex PS model and it’s regression equivalent in terms of code. I may just not be great at finding these though.
I’ve laid out some of the issues in Chapter 17 of BBR.
I didn’t understand why the likelihood doesn’t weight things correctly.
Big picture: too many researchers jump to propensity scores when direct adjustment using ANCOVA works fine. Too many researchers use matching as an analytical method for controlling for confounding, not recognizing that most matching algorithms are not reproducible (are sensitive to how the dataset happened to be row-ordered) and are arguably not the best science, when observations are deleted that are relevant. The latter happens very often when “good controls” are dropped because cases needing to be matched with them were already matched with controls appearing earlier in the dataset.
You could say that the bottom line is that matching methods are inefficient, lowering statistical power significantly while still not accounting for outcome heterogeneity. Fully conditional should be the standard, and penalized estimation (Bayesian or frequentist) IMHO provides the “linear” one-step solution to the problem.
Papers have been published in the medical literature (e.g., short-term effects of bariatric surgery for morbidly obese individuals, which was highly criticized) in which > 4/5 of the patients were dropped from the propensity matched analysis. Not because they were not applicable (i.e., PS non-overlap region) but because the matching analysis strategy was deficient.
I’ve seen PS used in examples where there are 80,000 subjects and 20 covariates. What are researchers thinking?
PS hides treatment interactions and I’m not convinced it allows for clear analyses of time-dependent covariates.
I hope that helps to frame the discussion for others to respond.
Sorry, It’s not that it doesn’t weight things correctly in the sense that it gives you a wrong answer. The argument I have seen is more theoretical in that you when you weight the likelihood you aren’t specifying a generative model.
Thanks for the reference. But I’m not seeing any role for weights in this particular context, unless individuals had unequal probabilities of being selected for the sample, the probabilities were known (and become inverse weights), and what comprised the probabilities was unknown. Were the sampling probabilities just functions of age, for example, one would condition on age and ignore the weights.
I fully agree with this; as a recovering propensity-match-a-holic, I somewhat fell victim to the clinical researchers’ fascination with…
…this. Many clinical researchers (specifically surgeons) that wanted to use propensity-score matched analysis did not seem to care very much about things like the accuracy or efficiency of estimates from a PSM approach versus a multivariable regression approach with direct adjustment for covariates; they simply thought it was cool to be able to create a “matched cohort” of Treatment A and Treatment B patients (the so-called “find a randomized controlled trial in observational data” phenomenon). With the benefit of more experience, I think I would/will push back harder against that in future collaborative efforts.
I’ve found that some colleagues prefer matching because it makes more intuitive sense to them than covariate adjustment. I’ve even had a reviewer argue that a randomized clinical trial where I used covariate adjustment should be reanalyzed using matching.
An inconvenient truth for us all: in many cases, even the most eloquent explanation of statistical advantages of one procedure versus another is often lost on people who simply think “but X makes more sense.” Frank and others have provided some good solutions for how one might try to convince skeptics (like showing them the results of simulations and/or methods papers) but it really depends on the quality of investigator you are working with. My report from “the trenches” in one location was that no methods paper or simulation would convince any of those investigators to deviate even the slightest from what they thought was the right plan: much as you described, if they wanted a matched analysis because they thought it made more sense, that’s what they were going to do.
Sigh. I would love to read your reply to this comment, if you have it around. May be useful in case I ever encounter a similar sentiment.
It was a doozy of a review, that’s for sure. At first we even thought that we just didn’t understand what they were saying, but quickly realized it could only be what it seemed. And if I recall, they were quite adamant about it, adding that it was the only way to really know whether a treatment worked. It was well over ten years ago, so the response is long lost, but I vaguely recall that we just politely pointed to basic texts on experimental design. Fortunately, the editor agreed with us.
And indeed, my experience in getting colleagues to read any sort of methods paper has been just as frustrating. The sociology of science looms large in all of this.
Someone had a catchy tweet about a statistician threatening to take on neurosurgery as a hobby. We need to come up with the best possible comeback along the lines “Just as I don’t try to dominate any discussion about clinical issues, you need to trust me to use my biostatistics expertise to the max. I’ve been studying this stuff my whole career.”
One problem in the surgical literature (which is rife with retrospective, observational trials) is that many journals will now refuse/resist publication of retrospective cohorts that are NOT matched in some way. The implication being that covariate adjustment alone will fail to provide adequate “control” of the bias arising from differential selection for treatments.
I don’t see much discussion of the use of inverse-probability of treatment weighting (based on the propensity score) which I feel avoids some of the issues Frank raised with efficiency (avoiding dropping “good controls”)? Does anyone have a better feel for how useful a technique it is? (I should probably make that a separate post!)
This is not the full story but I believe that weighted estimates are in general inefficient (high variance). That’s because to upweight underrepresented observations you must downweight well-represented ones, lowering the effective sample size. Weights have to sum to 1.0.
This is very interesting. I’m seeing lots of reasons to go with a regression model to adjust for covariates. Are there situations where case-control matching or propensity scores are preferable?
I don’t think so. Ellie Murphy and Miquel Hernan have a nice example of using time-dependent covariates to decipher the effect of adhering to treatment in a randomized cardiovascular study. Having the right baseline covariates updated often enough is key. The covariates influencing treatment changes have to be measured (or at least things well correlated with same).
I have been discussing this with some propensity score supporters up my way, and got directed to CNODES which is an initiative focused on using large distributed data networks (in the modest Canadian definition of large) to estimate causal effects of drugs. They have a series of lectures on their website, one of which is related to IPTW that describes a general benefit of propensity score models as coming from:
"For observational studies of drugs, researchers seeking to control for confounding often have a better understanding of the predictors of drug exposure than they do of the complex etiology of the disease under study. Since conventional methods aim to eliminate confounding by controlling for predictors of disease, they may sometimes be less useful than methods of control that are based on predicting exposure. This methods lecture will cover an exposure-based method known as IPTW, including: counterfactuals and average causal effects; inverse probability weighting to create a pseudo-population; stabilization; and implementation/an example."
Based on their interest in propensity score methods, I did a little more searching within the high-dimensional setting and found the following simulation.
Regularized Regression Versus the High-Dimensional Propensity Score for Confounding Adjustment in Secondary Database Analyses.
Selection and measurement of confounders is critical for successful adjustment in nonrandomized studies. Although the principles behind confounder selection are now well established, variable selection for confounder adjustment remains a difficult problem in practice, particularly in secondary analyses of databases. We present a simulation study that compares the high-dimensional propensity score algorithm for variable selection with approaches that utilize direct adjustment for all potential confounders via regularized regression, including ridge regression and lasso regression. Simulations were based on 2 previously published pharmacoepidemiologic cohorts and used the plasmode simulation framework to create realistic simulated data sets with thousands of potential confounders. Performance of methods was evaluated with respect to bias and mean squared error of the estimated effects of a binary treatment. Simulation scenarios varied the true underlying outcome model, treatment effect, prevalence of exposure and outcome, and presence of unmeasured confounding. Across scenarios, high-dimensional propensity score approaches generally performed better than regularized regression approaches. However, including the variables selected by lasso regression in a regular propensity score model also performed well and may provide a promising alternative variable selection method.
That has not been our experience (we find great performance of ridge regression), but there is a definite need for continued simulation. The number of candidate predictors may be very important.
I wonder then of this ties into my other post re: whether we should be making use of more pre-analysis simulation for the particular question at hand. At least then people have to clarify their view of how the data came to be in order to look at how their preferred method performs?
Great point. I think that simulation that is more tailored to the problem at hand (with regard to number of potential confounders, sample size, outcome variable distribution, exposure distribution) would be extremely valuable.
It’s nice to hear the (modest Canadian ;)) CNODES shout-out! I’m glad to hear that our lectures are of use.
As to the propensity score, and as a semi-regular user and teacher of PS methods:
I agree that they’re somewhat oversold and overused. They’ve become the norm in some fields, and sometimes assigned almost magical properties w.r.t. confounding control.
They are sometimes useful, especially in the case of rare outcomes and common exposures, but also in the case Tim outlines where we may be better off from a content perspective modelling the exposure.
To me, an important (the most important?) feature of the PS is as a diagnostic tool about the space of support in your data and about heterogeneity of effect. The paper by Kurth et al that I referred to on Twitter yesterday is a good example of this.
Finally, I echo the call for more, and more realistic simulation studies. The plasmode simulations cited here are a great idea, and a good way to generate realistic plausible data.