Overlap weights for small sample RCTs

A brief background: causal inference can be done with either a correctly specified treatment model (propensity score) or a correctly specified outcome model (typically regression), and is more efficient if both are correctly specified (doubly robust models). In randomized control trials, the treatment model is always correctly specified due to randomization, although baseline imbalances can reduce efficiency. Also in randomized control trials, the correctly specified outcome model is never truly known, although @f2harrell has excellent material on model flexibility to account for this.

With that background, I wanted to share the following article I recently read and get other’s thoughts on it:
Zeng S, Li F, Wang R, Li F. Propensity score weighting for covariate adjustment in randomized clinical trials. Statistics in medicine. 2021 Feb 20;40(4):842-58.

In their simulations, there seems to be a benefit of using overlap weights compared to a correctly specified regression for small sample RCTs in several settings. Although regression and other methods (and a larger sample size!) would still be needed for complex situations like conditional treatment effects and accounting for time-dependencies, it seems to me like overlap weights can be a useful tool for marginal treatment effects. Are there any downsides to using them in small sample RCTs?

I can’t imagine using propensity score in the same sentence as randomized trial but I hope someone has time to read the article and tell us more.

Here is what I gathered from reading it: the overlap weights are proportional to the probability of being assigned to the opposite group. In an RCT, the true propensity score is determined by randomization, but applying overlap weights creates exact balance so there are no chance imbalances in (observed) baseline covariates between groups. The treatment effect is then determined by group difference in weighted means. Overall it seems to be relatively comparable to a correctly specified linear regression model of y ~ x + zi + x*zi where i=10 covariates and y is a continuous outcome in simulation and they asymptomatically converge. The benefits are under imbalanced assignment or in small samples (<150 in this case) with model misspecification (missing interaction term). The case where linear regression is more efficient is when there is effect heterogeneity and a correctly specified model. There is also supplementary material on binary outcomes that I did not get into. More simulations are definitely still needed to explore other, more complex scenarios, and I’m sure flexible regression would be more informative for something like repeated measures with a nonlinear effect. However, many studies still use a t-test and overlap weights would always be more efficient than that.

1 Like

Thanks for that. It doesn’t make sense to me. The analogy is covariate adjusting for something exhibiting a chance imbalance in an RCT. This makes the inverse of the sums of squares and cross-products matrix X’X have off-diagonal terms (slight colllinearities) which cause inflation of the standard error of the treatment effect. Adjustment should be pre-specified and based on outcome predictors.

Yes, but they are using M-estimation theory and a sandwich estimator for the variance in the weighted effect. Someone smarter than me would have to comment on that though.

Thinking more about it, I think a better analogy would be sampling weights in survey data to model the target population, which in this case is a large RCT. The overlap weights are modeling the design phase rather than the analysis phase. The analysis phase is a difference in means so it does not actually condition on the covariates. Difference in means and regression converge asymptomatically so it makes sense that the overlap weighting and regression are relatively similar.

I don’t get that either. Randomized trials (other than infamous ones such as done in Tuskegee Alabama) use volunteers so there is no connection with a target population.

The target population would be an idealized version of the randomized trial that is either large enough or lucky enough to have balance in every measured covariate

That’s true but since we don’t sample from that population sampling theory breaks down in RCTs so I’m not sure of the relevance.

Yes, it’s just an analogy so it’s not completely accurate. I’ve quoted another article that hopefully explains it better. It uses inverse probability weighting but the same principles apply.

"we argue that propensity scores are a useful tool for the analysis of individually randomised controlled trials. This does, however, require a change in perspective. Rather than viewing the propensity score as a method of bias reduction as for non-randomised studies, we will view the propensity score analysis as a method of covariate adjustment aimed at increasing precision of the treatment effect estimate. In order to achieve this, we will move away from the philosophy of modelling the treatment allocation process and towards the idea of modelling chance imbalance or designed balance (such as stratified randomisation) of prognostic variables between treatment groups…

we apply our variance results to show that for continuous outcomes, the propensity score estimator has similar statistical properties to the covariate-adjusted (linear regression) estimator, but that in order to achieve comparable precision to the covariate-adjusted estimator, the standard error must be correctly estimated; naive estimators of the standard error can greatly understate the precision of the propensity score estimator…

we note that the estimation of the propensity score is not a source of variation in the usual sense. In a simple randomised trial, the true propensity score is 0.5 for all participants; thus, using this in inverse weights will result in unadjusted estimators of treatment effect. Inverse weighting using the true propensity score does not, therefore, take account of chance imbalance of prognostic baseline characteristics. The variability in the estimated propensity score simply reflects chance imbalances in the prognostic baseline characteristics that are included in the propensity score model across treatment groups. The increased balance of these covariates caused by the weighting using these estimated propensity scores (in comparison with using the true propensity score, i.e. calculating unadjusted estimates) results in increased precision of the treatment effect estimate. Failure to account for the estimation of the propensity score will not take into account the decrease in precision obtained by the re-balancing of prognostic characteristics created by weighting using the estimated (rather than true) propensity score; thus, the precision will be falsely decreased…

If we instead estimate the IPTW treatment effect by fitting a probability-weighted outcome regression model as described previously, the same considerations apply. Standard robust or empirical sandwich variance estimators for the estimated regression parameters typically treat the probability weights—which in this case are functions of the estimated propensity scores—as known values. To obtain the correct variance for the estimated regression parameters from these probability-weighted models, the full sandwich variance estimator based on all the estimating equations, including the components estimating the propensity score, involved in the estimation process should be used. This is our sample variance estimator given in Equation (4)."

Equation 4:
1000009873

Williamson EJ, Forbes A, White IR. Variance reduction in randomised trials by inverse probability weighting using the propensity score. Stat Med. 2014 Feb 28;33(5):721-37. doi: 10.1002/sim.5991. Epub 2013 Sep 30. PMID: 24114884; PMCID: PMC4285308.

This is incorrect. Computing the propensity score (PS) from randomized data fits noise by definition, and finds baseline variables that are apparently maximally collinear with treatment. This creates a covariance matrix where the off diagonal terms are maximized in absolute value. This detracts from the precision of the treatment effect as can been seen in the matrix algebra of (X'X)^{-1} for the data design matrix X where columns of X are dubiously chosen by being unbalanced with regard to the first column (the treatment indicator).

Improving precision of the treatment effect in an RCT comes from adjusting for prognostic variables that explain outcome heterogeneity, so that you look at the data with a clearer lens and make the distribution of Y conditional instead of representing a mixture of distributions.

So if I understand your point correctly, weighting would make sense only if using a priori prognostic variables similar to regression rather than fitting all baseline covariates?

I don’t think either paper argues against that and the simulations used in both papers only include prognostic covariates. I’m not sure if this makes a difference, but they also note: “Failure to account for the estimation of the propensity score will not take into account the decrease in precision obtained by the re-balancing of prognostic characteristics created by weighting using the estimated (rather than true) propensity score; thus, the precision will be falsely decreased… To obtain the correct variance for the estimated regression parameters from these probability-weighted models, the full sandwich variance estimator based on all the estimating equations, including the components estimating the propensity score, involved in the estimation process should be used.”

Regardless, I agree variable selection is important, but I don’t see it as a reason against using overlap weighting. To me, it seems like it’s relatively comparable to regression so the use of one or the either depends on the specific case where one is better suited.

No, weighting should be used only in observational studies or in randomized studies with time-dependent confounding such as adherer analysis. Even in observational studies, weighting, like matching, is inefficient.

I’m sorry, but can you provide data to substantiate that overlap weighting is inefficient compared to regression in RCTs or show me where I’m missing that in the Zeng article? As far as I’m aware, the Zeng article is the only published simulation on this particular method and it suggests that although inverse probability weighting is less efficient than regression in small samples, overlap weighting is comparably efficient to regression in small samples, and in large samples all three are comparably efficient.

You continue to confuse a randomized study with an observational one.

I am unsure how I am confusing a randomized study and an observational one when the simulation is clearly based on a randomization study. I respect that you are an expert on regression, but I suspect that is causing some cognitive dissonance/bias about this alternative method. I am nowhere near as experienced as you and am very willing to accept being wrong if you can point out specifically in the Zeng paper where the error is. This is a novel design so I do not trust prior experiences with other designs and your first post implied not reading the paper.

I just can’t get over a novel method being chosen that ignores that all imbalances in an RCT are chance imbalances so are not real. Proper modeling concentrates on the only important thing: clarifying outcome heterogeneity.

I think there is some confusion. This method agrees with imbalances in an RCT are chance imbalances. It’s just an alternative way of clarifying outcome heterogeneity.

I messaged the corresponding author, Fan Li, and he graciously directed me to “Shen, C., Li, X. and Li, L. (2014), Inverse probability weighting for covariate adjustment in randomized studies. Statist. Med., 33: 555-568.” In this, the authors provide proofs that RCT covariate adjustment is similar between weighting and regression. This is quite long so I have not tried to summarize it here, but I would recommend reading it if you do not think this is the case. If we can agree on this mathematical fact, then I think we agree on the rest.

I don’t have time to read it but it still doesn’t make sense to me. Outcome heterogeneity is never characterized by comparing baseline variables between treatments. it is characterized by comparing outcomes within treatment between patients with different baseline covariate values.