Propensity scores and weighting as causal inference approachs

First of all, I want to say that I am still learning, and my perspective may have misunderstandings. I hope to receive guidance and corrections.

Regarding propensity scores and weighting (e.g., Inverse Probability Weighting, IPW) in observational studies, what I’ve learned is that they are low power methods and should not be used! The only scenario where propensity scores are applicable is when the sample size is very small, for example, when the number of outcome events is less than ten times the number of adjustment variables. In such cases, propensity scores may outperform multivariable adjustment. Overall, matching and weighting methods are not good practices.

Propensity scores and weighting seem theoretically appealing, which is why they were once very popular. However, their real-world performance is far inferior to their theoretical promise, and as a result, they are now rarely used. At least, in the papers I’ve read, I seldom see propensity scores anymore.

The above reflects my personal understanding based on my learning.

However, I am aware that many people regard propensity scores and weighting as causal inference methods, with some even equating causal inference entirely with matching and weighting. Recently, I came across a method called Target Trial Emulation, which aims to emulate randomized trials using observational studies (a concept I find somewhat odd, though I acknowledge its necessity in real-world applications). This method employs Inverse Probability Weighting (IPW) as a causal inference approach.

This is where my confusion lies. I had always thought that propensity scores and IPW should not be used. But now I see studies published in JAMA or BMJ using these methods. Have I misunderstood something, or am I missing something important?
Association Between Early Treatment With Tocilizumab and Mortality Among Critically Ill Patients With COVID-19

If anyone is willing to patiently share their thoughts, I would be very grateful and benefit greatly. Thank you!

2 Likes

Not directly related to the appropriateness (or lack thereof) of propensity scores, but publications in medical journals really should not be taken as a guide for statistical practice.

There’s no shortage of poor quality statistics at the top medical journals, including arbitrary categorization, misinterpretation of P-values, poor model comparison methods, inappropriate variable selection methods, failure to incorporate recurrent events, and the list goes on.

4 Likes

Thank you very much. I agree with your opinion. During my doctoral studies, I was taught to learn statistical methods from research papers, but I later realized that this approach could indeed be misleading. However, let’s return to the topic at hand. While some methodological articles do not explicitly state the methods they use, the practical papers on target simulation experiments (that I have read so far) all seem to adopt this weighting approach.

Additionally, my boss is very interested in causal inference, and his view is that matching and weighting are causal inference methods. I have also been required to use these methods.

Finally, the purpose of my question is to determine whether implementing target simulation experiments using the weighted method or being required to use matching and weighting as causal inference methods is appropriate. This is because I remember many people opposing matching and weighting. This is the point I want to confirm.

2 Likes

I recently asked for major revisions on a medical statistics paper where the authors discarded hundreds of thousands of observations using propensity score matching in the name of answering a “causal question” (what is the cost of admissions where the patient deteriorates vs doesn’t deteriorate). My objections were dismissed because other papers by famous authors used matching. I think the editor considered my disagreements with the methods to be nitpicky.

All this is simply to say that there are often very good reasons against a method and they will be completely ignored because of longstanding conventions. I think the limitations of matching are most apparent when there is a material difference between patient groups that cannot be adequately incorporated within the algorithm, and when matching means throwing away lots of data.

I think this topic is also discussed in the ‘common statistical errors’ thread on this forum.

5 Likes

Matching and weighted are no more causal than most other methods, and often uses silly estimands when there is outcome heterogeneity within exposure group. Weighting is just a method that allows you to relax certain non-causal assumptions in hard-to-model situations such as nonrandom dropouts.

Discarding valid observations is unscientific and irreproducible.

3 Likes

I have three comments about the initial post:

First, propensity-based methods or matching may indeed be less statistically efficient, but they have their advantages. For example, matching is very epistemically convincing, it’s a method you can explain to your elderly aunt over a family dinner. This is not nothing, especially for high-profile policy-informing work that may have big impact on the public in which communication is crucial.
Therefore, I disagree with the categorical cancelation of such methods.

As for target trial emulation, that’s a general framework for approaching observational studies with fewer design-induced biases. The statistical analysis is just one step out of several specified in the approach. Most importantly, there is nothing in the framework specifying how to adjust for confounding in the statistical analysis. The fact that people use propensity-based method has nothing to do with the target trial emulation and just has to do with these methods being very popular in observational studies.

Third, the attached paper uses IP-weighted Cox regression (rather than just adjusting for the covariates directly in the regression), and adjusts for a paragraph long of covariates (I counted 26?). From my experience, Cox regression tends to be finicky for high-dimensional data (although my experience is with 100s of covariates [and Python]), suffering from numerical convergence problems if not properly regularized. And I occasionally found adjustment through weighting as a simple solution to overcome this.

3 Likes

A lot of things are easy to explain when they are oversimplified and completely cover up all their assumptions (or as in matching or weighting, inflate the variance of the estimator of interest). We don’t need to be able to easily explain a method any more than biochemists should be called on to simply explain a complex biochemical pathway. But we do need to be able to explain the interpretation of the key results arising from using the methods, and to check the method’s assumptions.

3 Likes

Thanks, Frank. I don’t disagree; and I don’t have an evidence-backed argument, only a hunch - but in some extremely high-profile scenarios statisticians will need to explain their methods just as biochemists will need to explain biochemical pathways.
Take the covid vaccine for example.
Communication was extremely crucial to engage the public and encourage it to cooperate.
Biochemists explaining the vaccine mechanism was therefore important to seed trust.
Displaying cumulative incidence curves (rather than survival) in the Pfizer vaccine was simpler to communicate.
And proving nationwide effectiveness had to been as easily communicated, and I think they made a great choice choosing a simple (coarsened) exact matching rather than a fancier analytical method (and I’m positive the tradeoff was easier given they had a dataset large enough to make inflated variance a somewhat negligible issue). But I heard the authors explain the methodology on the main evening news program, you can’t do that with an ordinal longitudinal markov model :upside_down_face:

2 Likes

Great points. On the small part of this related to showing survival curves vs. cumulative incidence curves I personally think that cumulative incidence curves should almost always be chosen. For one reason, journals let you choose the y-axis limits whereas for survival curves they often force you to scale [0, 1].

2 Likes