Propensity scores and weighting as causal inference approachs

Ahmed_Sayed · December 25, 2024, 6:00pm

Not directly related to the appropriateness (or lack thereof) of propensity scores, but publications in medical journals really should not be taken as a guide for statistical practice.

There’s no shortage of poor quality statistics at the top medical journals, including arbitrary categorization, misinterpretation of P-values, poor model comparison methods, inappropriate variable selection methods, failure to incorporate recurrent events, and the list goes on.

JiaqiLi · December 26, 2024, 1:18am

Thank you very much. I agree with your opinion. During my doctoral studies, I was taught to learn statistical methods from research papers, but I later realized that this approach could indeed be misleading. However, let’s return to the topic at hand. While some methodological articles do not explicitly state the methods they use, the practical papers on target simulation experiments (that I have read so far) all seem to adopt this weighting approach.

Additionally, my boss is very interested in causal inference, and his view is that matching and weighting are causal inference methods. I have also been required to use these methods.

Finally, the purpose of my question is to determine whether implementing target simulation experiments using the weighted method or being required to use matching and weighting as causal inference methods is appropriate. This is because I remember many people opposing matching and weighting. This is the point I want to confirm.

robinblythe · December 26, 2024, 7:00am

I recently asked for major revisions on a medical statistics paper where the authors discarded hundreds of thousands of observations using propensity score matching in the name of answering a “causal question” (what is the cost of admissions where the patient deteriorates vs doesn’t deteriorate). My objections were dismissed because other papers by famous authors used matching. I think the editor considered my disagreements with the methods to be nitpicky.

All this is simply to say that there are often very good reasons against a method and they will be completely ignored because of longstanding conventions. I think the limitations of matching are most apparent when there is a material difference between patient groups that cannot be adequately incorporated within the algorithm, and when matching means throwing away lots of data.

I think this topic is also discussed in the ‘common statistical errors’ thread on this forum.

f2harrell · December 26, 2024, 10:04am

Matching and weighted are no more causal than most other methods, and often uses silly estimands when there is outcome heterogeneity within exposure group. Weighting is just a method that allows you to relax certain non-causal assumptions in hard-to-model situations such as nonrandom dropouts.

Discarding valid observations is unscientific and irreproducible.

ehudk · January 1, 2025, 8:16pm

I have three comments about the initial post:

First, propensity-based methods or matching may indeed be less statistically efficient, but they have their advantages. For example, matching is very epistemically convincing, it’s a method you can explain to your elderly aunt over a family dinner. This is not nothing, especially for high-profile policy-informing work that may have big impact on the public in which communication is crucial.
Therefore, I disagree with the categorical cancelation of such methods.

As for target trial emulation, that’s a general framework for approaching observational studies with fewer design-induced biases. The statistical analysis is just one step out of several specified in the approach. Most importantly, there is nothing in the framework specifying how to adjust for confounding in the statistical analysis. The fact that people use propensity-based method has nothing to do with the target trial emulation and just has to do with these methods being very popular in observational studies.

Third, the attached paper uses IP-weighted Cox regression (rather than just adjusting for the covariates directly in the regression), and adjusts for a paragraph long of covariates (I counted 26?). From my experience, Cox regression tends to be finicky for high-dimensional data (although my experience is with 100s of covariates [and Python]), suffering from numerical convergence problems if not properly regularized. And I occasionally found adjustment through weighting as a simple solution to overcome this.

f2harrell · January 4, 2025, 11:56am

A lot of things are easy to explain when they are oversimplified and completely cover up all their assumptions (or as in matching or weighting, inflate the variance of the estimator of interest). We don’t need to be able to easily explain a method any more than biochemists should be called on to simply explain a complex biochemical pathway. But we do need to be able to explain the interpretation of the key results arising from using the methods, and to check the method’s assumptions.

ehudk · January 7, 2025, 1:26pm

Thanks, Frank. I don’t disagree; and I don’t have an evidence-backed argument, only a hunch - but in some extremely high-profile scenarios statisticians will need to explain their methods just as biochemists will need to explain biochemical pathways.
Take the covid vaccine for example.
Communication was extremely crucial to engage the public and encourage it to cooperate.
Biochemists explaining the vaccine mechanism was therefore important to seed trust.
Displaying cumulative incidence curves (rather than survival) in the Pfizer vaccine was simpler to communicate.
And proving nationwide effectiveness had to been as easily communicated, and I think they made a great choice choosing a simple (coarsened) exact matching rather than a fancier analytical method (and I’m positive the tradeoff was easier given they had a dataset large enough to make inflated variance a somewhat negligible issue). But I heard the authors explain the methodology on the main evening news program, you can’t do that with an ordinal longitudinal markov model

f2harrell · January 8, 2025, 12:33pm

Great points. On the small part of this related to showing survival curves vs. cumulative incidence curves I personally think that cumulative incidence curves should almost always be chosen. For one reason, journals let you choose the y-axis limits whereas for survival curves they often force you to scale [0, 1].

samw235711 · January 28, 2025, 10:32pm

The question is why IPTW instead of multivariate adjustment.

Let Y be the outcome, X the confounders, and A the binary treatment.

Bear in mind, all of this assumes no unmeasured confounders - that we know that X contains everything that impacts Y and A.

Let Y,X,A \sim P(Y,A,X)=P(Y|A,X)P(A|X)P(X).

Now focus on P(A|X). This is a treatment distribution - it is how treatments are assigned based on confounders (baseline covariates). It is the “true” treatment distribution - in other words, it generated the data. (For intuition, in a trial P(A|X)=0.5.) I am going to subscript expectations with respect to these treatment sampling distributions.

Now consider a different treatment distribution, which we would like to evaluate. In particular, consider the distribution I(A=a), which is 1 if the treatment is a and 0 otherwise. This is what we are after in causal inference. We ask: what would the average outcome be, were we to use treatment a, even if, in reality, we actually sampled the treatment random variable A from P(A|X), sometimes setting A to a and sometimes to not a.

EY(a) is the counterfactual expectation we would like to know - what would happen on average if we could set the treatment A to a.

Note that
EY(a)=E_{I(A=a)}Y=\int y P(y|a,x)I(A=a)p(x)dP.

One way to think about IPTW is as a density transform.

Note (I only multiply by 1):
E_{I(A=a)}Y=E_{I(A=a)}\frac{P(Y,A,X)}{P(Y,A,X)} Y = E_{P(A|X)}\frac{P(Y|A,X)I(A=a)P(X)}{P(Y,A,X)} Y which can be estimated with \frac{1}{n} \sum \frac{I(A_i=a)}{P(A_i|X_i)} Y_i, which is an IPTW estimator with density P(A_i|X_i) as the propensity.

In contrast multivariate adjustment would be something like Y=\theta A + \eta X+\epsilon, where we try to assume a normal distribution for Y|A,X that is linear in its parameters. If that multivariate model is wrong (it almost certaintly is), then the causal inference is wrong.

Whereas with IPW only P(A_i|X_i) needs to be specified correctly, and generally this is done with a nonparametric estimator (will need more sample to do this, of course). It is easier in theory to specify and estimate P(A|X) in IPW than P(Y|A,X) as in the multivariate adjustment case.

Hence IPW is a very popular method, and one would not generally use multivariate adjustment - a multivariable model may require less sample, but the bias from mispecifying the model doesn’t make up for it.

Pavlos_Msaouel · January 28, 2025, 11:59pm

Good point. The one counterargument is that bias should be compared between IPTW and non-parametric outcome regression for a more “apples-to-apples” comparison. When that happens Bayesian non-parametric outcome regression can, perhaps counterintuitively, beat even the doubly robust augmented version of IPTW (see also the related discussions and rejoinder at the end of the pdf version).

robinblythe · January 29, 2025, 2:33am

This is a useful explanation, thanks. However, I do think the assumption that X contains all potential confounders to be a fanciful one, especially in medicine/epidemiology. In this context, as a reader or analyst I would probably prefer the wrong multivariate model to the wrong propensity weighted approach because it is more transparent.

daszlosek · February 3, 2025, 6:44pm

I read this manuscript over the weekend that seemed relevant to the conversion so I thought I would summarize and spur some more possible discussion.

While this is only partially related to the original question, I wanted to share a relevant simulation study by Schuler and Rose (2016) comparing Inverse Propensity Score Weighting (IPW), G-Computation, and Targeted Maximum Likelihood Estimation (TMLE).

I initially planned to explore this with my own simulated data, but I found a well-done paper this weekend with a great visualization of these comparisons.

The study used a simulated dataset where the true data-generating mechanism was:

Y ~ A + X1 + X2 + X3 + A*X3

where X3 has known interaction with A

The authors examined how misspecification impacted estimates by testing:

1 - Misspecified outcome regression – omitting the interaction term (A*X3)
2 - Misspecified exposure regression – omitting X3 entirely from the propensity score model.

They compared results to the true data-generating process using three methods, both with and without Super Learner:

TMLE
G-Computation
Inverse Propensity Score Weighting (IPW)

Here are the scenarios they looked at:

Scenario	Outcome Model Specification	Exposure Model Specification
Correct Model	Y ~ A + X1 + X2 + X3 + A * X3	A ~ X1 + X2 + X3
OV Outcome	Y ~ A + X1 + X2 (omits X3)	A ~ X1 + X2 + X3 (correct)
OV Exposure	Y ~ A + X1 + X2 + X3 + A * X3 (correct)	A ~ X1 + X2 (omits X3)
MT Outcome	Y ~ A + X1 + X2 + X3 (omits interaction A * X3)	A ~ X1 + X2 + X3 (correct)

They looked at % bias as the average estimated treatment effect across 1,000 simulated sets vs the true average treatment effect known from the data-generating process.

Of course, there are some limitations:

1 – This is a simulation so while we know the true effect it does not fully reflect reality. No time varying covariates, data sparsity ect. (there does exist LTMLE for longitudinal data)

2 – To the point of @Pavlos_Msaouel are they really measuring apples to apples here?

3 - As @robinblythe stated, it is extremely difficult to state no unmeasured confounders - so where does that leave casual inference?

I imagine TMLE would fall under what @f2harrell might call ‘silly estimands,’ and I completely agree that propensity score matching (where you actually discard data) has significant issues. However, would techniques like propensity score weighting be considered just as problematic?

samw235711 · February 5, 2025, 1:24am

@Pavlos_Msaouel thank you, yes I defined the parametric multivariate approach too narrowly. For some reason I guess the term “multivariate adjustment” is linked in my mind to parametric models, but really it is E(E(Y|a,X)), in it’s general form. If one mis-specifies the propensity in IPW they will be in a lot of trouble, too - I think the chart from @daszlosek gets at this.

@robinblythe agreed, it’s really about whether one is more comfortable estimating the outcome model E(Y|a,X), as in multivariate adjustment, or propensity P(A|X), as in weighting.

I think there are times when one or the other is easier.

In terms of the “no unmeasured confounder assumption,” I think a question is whether it’s better to make fewer claims about these scenarios in general, or to make some claims under assumptions that are not testable, but maybe from a domain expert’s perspective can be more or less justifiable.

f2harrell · February 5, 2025, 12:36pm

We need precise language. Not “the effect of x is …” but “accounting for a, b, c, … and assuming that other baseline states that were not measured are not both related to outcome and to x, the effect of x is …”.

timdisher · February 5, 2025, 3:31pm

I don’t think there’s any need to allow the treatment model to be flexible/non-parametric but require the outcome model to be linear in parameters eg (https://arxiv.org/abs/1707.02641) among many other options.

I’ve always found the most compelling reasons for PS methods to be:

Sometimes the event is very rare and the treatment assignment model is comparatively rich which allows for more complex/realistic models on treatment than outcome.
Clinicians may feel more comfortable thinking through important variables/interactions in the treatment model than the outcome model. I think there’s also an assumption that the treatment model is generally simpler than outcome model, but that’s just a sense I’ve picked up from conversation.
Prescribers seem more comfortable looking at something that looks like a table 1 from an RCT than thinking about a regression equation.
Implementations of some of the more complex methods (time varying confounding, marginal estimates from non-linear models) are easier or more available/intuitive with weighting methods.

Three and four are a little more about salesmanship/accessibility than real arguments in favour of one over the other in my opinion.

Pavlos_Msaouel · February 5, 2025, 4:07pm

Then these are actually arguments for not prioritizing PS methods. Even the strongest PS advocates in the methodology world would agree that the covariates included in PS should always impact outcome and never exclusively treatment assignment. Thus, if clinicians focus on the treatment model they run the risk of including purely exposure predictors that can increase bias.

An indirect reason why we call “Table 1 fallacy” the propensity of prescribers to look at Table 1 for RCTs is exactly to prevent such errors.

timdisher · February 5, 2025, 4:54pm

I think you raise some good points but I’m not sure I entirely agree. I can’t think of many cases from my experience where focus on treatment model has identified variables that have no feasible theoretical or empirical effect on outcome so I think the appeal is more that it just somehow feels more tangible for a lot of clinicians to think like “I’m more likely to prescribe this to men but not in certain subgroups of xyz variable” than it is more them to think of actual prediction of outcome. I can relate to you arguing (I think?) it shouldn’t really matter since you need to think of outcome either way, but a lot of people do genuinely seem to struggle and I’m not convinced true null effects are so rampant in the subset of variables considered in most PS studies that it would really make a difference so it just seems like a preference thing to me.

Re: Table 1 fallacy I agree in the realm of using significance tests to select adjustment variables in RCTs and with the underlying theory of “just adjust for everything prognostic” but the appeal of Table 1 in non randomized studies is that a) it includes everything you thought was important to include in your model; and b) it’s immediate feedback that the model “worked” in terms of the balance it was targeting using SMDs or similar. Combining that with ESS you have a really powerful summary that a lot of non technical clinicians can react to intelligently/junior analysts can understand so while it’s maybe not perfect I do think there’s a lot of situations where the PS version of an analysis was done to a higher standard than the outcome model version.

f2harrell · February 6, 2025, 2:05pm

What @Pavlos_Msaouel wrote is exceptionally important. Table 1 is a misleading presentation of the data that covers up what is going on. Table 1 should be replaced with information such as what is proposed here. PS doesn’t instill good statistical practice because it leads its practitioners to pretend such things as elderly patients don’t die sooner than young patients.

samw235711 · March 20, 2025, 7:08pm

In agreement about the parametric part, thanks @Pavlos_Msaouel for pointing it out earlier. In summary, the question is whether one is more comfortable specifying an outcome model or propensity model, and both can be either parametric or not. Let’s assume they are nonparametric - then the question is what confounders to include.

I like the thinking in @timdisher 2.
I think it is more likely that one can satisfy no unmeasured confounders in medicine with a propensity model when there is a human recommending treatments. One can ask the provider what information they take into account. I therefore might tend toward a method like inverse-probability weighting in this case. In other areas of study, it’s much harder to specify a propensity, since nature assigns the treatment /exposure.

Either way, the language always needs to alert to the strong untestable assumptions present in this kind of work, as says @f2harrell. It’s bad from a clinical standpoint if an observational study is accidentally interpreted as an RCT.

This said, I am also humbled by the potential yield of being able to use routinely collected data (now so plentiful, with the EHR) to make trial-like statements. Also, there are so many important medical problems that do not permit randomization.

In this work, we try to promote transparency in observational data analysis. In other words, the method shines a light on how an estimator derived using observational data (and the corresponding assumptions) might nudge an existing guideline to better maximize utility, if possible. Because it’s transparent in this way, if the suggested change to guideline doesn’t make sense (possibly due to causal assumptions that are violated), it will be detected by the end-user before implementation into clinic. Also, this type of transparency generates interesting discussions about how to change guidelines in the direction of optimality, whether our sense of optimality is even coherent, and how much we can really use observational data to do this.

f2harrell · March 21, 2025, 11:32am

This is very interesting and very different from other approaches I’ve seen. The proof is in the pudding. Can you either create realistic simulations with known truths and apply the method, or apply it to observational datasets where we know the truth from randomized studies?