Propensity score matching vs multivariable regression

@albertoca, instead of

I would say, “In trouble no matter what if there’s no overlap between the treatment and control groups.” The huge confidence bands that @f2harrell mentioned are the “fault” of the underlying lack of overlap, not of PS methodology in and of itself. As he mentioned, you can still technically carry out the analysis if you are willing to make and rely on some strong modeling assumptions. I’m not usually working in areas where I’m comfortable doing that, though. So I would say that if your treatment and control groups are really so different from each other, it would probably be good to step back and ask whether the dataset you have is really able help you answer the question you’re interested in.

2 Likes

For example, imagine that I have 8 key pre-specified confounding factors, plus the main effect I want to measure, for example the therapeutic effect, 9 in total. The database has 5 x 9 = 45 events. So, this would be to use the effective sample size on the covariates, and then add another covariable that would be the spline function for the PS logit? Is that it? Doesn’t the regression become unstable when you put that last variable in?
What is still difficult for me is to grasp the need or not to exclude the deciles that do not have adequate overlap according to the PS, in front, for example, to place an interaction between the therapeutic effect and the spline of the logit of the PS. According to the last comment, I understand that the latter is worse. The assumption that there is no interaction between the PS and the therapeutic effect is difficult to assume on many occasions. In fact the starting hypothesis is that there should be such variations of effect.

Sorry about the term screwed. Since I am Spanish, I am not sure if this English term is considered rude or impolite. I agree that it is very important that we ask ourselves honestly and critically, whether the kind of question we are asking ourselves is possible to answer or not. However, it is true that this is sometimes a matter of degree, and weighing it is difficult. Thank you for your comment.

1 Like

Don’t think quantiles (e.g. deciles) when thinking about overlap. Think absolute sample size. Overlap for example might be defined as regions with \geq 20 observations in both groups. But your sample size is perhaps too small to know about overlap.

“to place an interaction” is not clear. This is an additive model for now.

For 45 events you are sort of in trouble no matter what. I would use clinical expertise to pre-specify two covariates and put the rest in PS.

1 Like

@albertoca, no need to apologize! I didn’t use “screwed” in my response because it’s not a word I usually use in professional settings, but there are other native English speakers who do, and I did not think you were being rude or impolite.

1 Like

This has been coming up quite a bit. I wish there were some direct references to address it.

This paper in JACC helps; do you have others?

1 Like

https://hbiostat.org/bib/propensity.html

1 Like

So the question is: why is PSM so popular in studies with a sufficient sample size? For my taste it generates very ugly articles. The PSM is ugly to explain in the methods section of the articles. Most of the time the authors do not explain exactly how they have done it. Worse still, many times multiple end points are evaluated in observational studies, making 4 or 5 different PSMs, with arbitrary criteria, which are not fully explained. When they try to explain in detail, the result tends to be convoluted or difficult to read. I do not like this technique very much, unless it has a justified purpose, as explained before.

2 Likes

PS matching is easy to explain and has the appeal of reporting events for each of the two groups. It can be presented like an RCT. I think this is where it derives is fanfare.

Totally agree about the myriad arbitrary choices to be made in the process.

1 Like

The fact that many readers feel they are getting a synthetic clinical trial is one reason to not using matching. But the biggest reasons not to use matching are, with the matching algorithms commonly used, that by excluding observations matching is not reproducible nor is it scientific. Less controversial is that it is simply inefficient by lowering the sample size. Most matching algorithms exclude huge numbers of good matches (comparable patients).

3 Likes

I totally agree with you, but I also insist on the aesthetic issue. Papers based on PSM techniques, including my own, are invariably ugly and not very credible.

1 Like

I’m working on a small cohort study (total sample size ~1500) where treatment and control groups are mostly similar on measurable characteristics (by design via inclusion/exclusion criteria). There are a few key differences in baseline characteristics to control for (e.g. family history of the outcome of interest). I need to decide between PS-based IPW and adjustment for covariates. There is a lot of overlap between groups, so PS weighting is fine; the number of covariates to adjust for is relatively small, so adjustment is fine too. I understand that we need to make a decision about whether a marginal or conditional hazard ratio is of interest, but my concern is power - I have to account for the estimation of the PS in my regression coefficient estimate standard errors. Covariate adjustment seems like the best approach, but I need a good reference to make my case (or I have to run a simulation study). Other than Elze et al.'s paper (Comparison of Propensity Score Methods and Covariate Adjustment: Evaluation in 4 Cardiovascular Studies - ScienceDirect) which discusses precision in the context of extreme weights (which won’t be a problem in this study), is there another reference with a direct comparison of power/standard errors between methods (PS-based weighting and covariate adjustment)? I read through the reference list that Dr. Harrell posted and nothing stood out to me. Thanks so much for this discussion.

There are some useful references in the link I put above. Direct covariate adjustment, possibly including a spline of the logit of PS if you have more variables to adjust for, is expected to work better. This also gives you proper conditioning to get the marginal hazard. I can’t imagine a situation where I’d want a marginal HR, and marginal models tend to have more non-proportional hazards with regard to the exposure variable.

Thank you, this post & discussion are helpful. Could you perhaps comment on the appropriateness (or not) of a log-binomial regression model with propensity score adjusted standardized mortality ratio weighting in a retrospective, observational study? Thank you, :wink:

I’m not sure of why a log-binomial model would fit the data, nor of the usefulness of standardized mortality ratios. A log link restricts regression effects in order for probabilities to stay in [0,1], and this creates false interactions among risk factors just to achieve this restriction. It’s more common to use a link that places no restrictions, i.e., to use things like odds ratios, hazard ratios, and survival time ratios (the latter in an accelerated failure time model). You can translate the final results into any metric you want, e.g., a covariate-specific risk ratio.

1 Like

ok, thank you, that is helpful. Yes, risk ratios were used: unadjusted, adjusted for mortality, & adjusted for propensity scores. Perhaps it was done correctly then… Also, I checked the 95%CI for the RR and they calculated correctly, as I understand that the CI are pre-determined.

The question is whether the model assumptions are satisfied.

1 Like

I just read through this post and this was really very helpful Frank as I believe it solved a dilemma I have had for some time now

The issue I had was that if I run a logistic regression to ascertain the effect of a treatment on an outcome and use a PS weight and compare this with regression adjustment I rarely get the same OR for the effect of treatment even when all variables are perfectly balanced through IPTW weighting. Obviously the regression adjustment was the correct result as I tested this out with a single covariate (where all variables were binary) and each time the PS weight perfectly balanced the third variable but the OR from regression adjustment and PS weights continued to be different. I attributed this to the weights getting rid of confounding but not systematic error induced by the competing variables as they are mostly prognostic for the outcome. I then tested this by having a pre-balanced data-set on the third variable and then running regression adjustment and as expected the marginal and adjusted OR differed. I then used your advice today and adjusted only for the logitPS and all ORs are now the same with either the logitPS adjustment or individual covariate adjustment.

So would you agree that PS weights balance variables nicely but fail to remove the effect of competing variables on the outcome?

1 Like

Yes. Adjusting for PS, whether by regression on logit PS, stratification, matching, or weighting, only addresses confounding and has nothing to do with proper modeling of outcome heterogeneity. That’s why you add pre-specified free-standing prognostic variables as covariates when adjusting for PS (regression or stratification adjustment).

3 Likes

I would bet the discrepancy is not due to some inherent failure of PS weighting but rather to the fact that the two approaches are targeting different estimands. PS weighting targets the marginal ATE (which can be measured on the OR scale), and the treatment coefficient in a covariate-adjusted logistic regression targets the conditional treatment effect (and assumes it is constant on the OR scale). If you use g-computation to estimate the marginal ATE from your logistic regression model, you will likely see that PS weighting and g-computation yield approximately equal results, with greater precision from the g-computation estimates. You will need to find a way to compute the true marginal OR, though, as this is not immediately apparent from a simulation design like yours.

The benefit of the marginal ATE is that it is valid for a population regardless of whether effect modification is present, and PS weighting is agnostic to heterogeneity in the treatment effect. PS weighting and g-computation are both valid estimators of the marginal ATE (and they differ primarily in their modelling assumptions). The conditional effect can only be estimated with covariate-adjusted logistic regression or subgroup analysis. Some argue that reporting the coefficient on the treatment in a covariate-adjusted logistic regression makes the unwarranted assumption that the OR of the treatment is constant across strata of the covariates and that only the marginal ATE can be reported without assumptions about heterogeneity. Of course, if there is heterogeneity, the marginal ATE will differ based on which population is being studied.

All this is to say is that discussions comparing the empirical performance of regression and propensity scores need to ensure the same estimand is being targeted and the same assumptions about effect heterogeneity are being invoked before they can be compared. In this discussion, I have not seen this clarification, and this likely explains any large differences observed between the results of the two methods.

1 Like