Guidelines for Covariate Adjustment in RCTs

i’m susceptible to concise recommendations, thus i like this paper and tempted to share it. “Covariate adjustment for randomized controlled trials revisited

Although it is very difficult to make general recommendation for CA in RCTs, the following suggestion is in order:

  • take potential heterogeneity into account and fit models with interaction terms or to data of different treatment groups separately; although in the special case of 1:1 randomization, there is no asymptotic efficiency gain.43 The later approach is especially useful when a large number of covariates should be considered.
  • Report both adjusted and unadjusted ATE estimates with valid CIs, even when some might be conservative.
  • Use simulations to evaluate/compare the performance of different approaches based on some most possible scenarios, especially when the sample size is not large.
  • Do not exclude subjects because of missing covariate values. Instead, use simple approaches to impute the missing values.
  • Some model/covariate selection approaches for big data with many covariates should be used, if appropriate for a specific scenario.
  • Be transparent with model and covariate selection processes and prespecify them as specific as possible, but do use them when necessary for better treatment effect estimation.

Finally, it should be emphasized that practical experiences play a key role in appropriate use of CA approaches in RCTs.

obviously the rules don’t carry over neatly to obs studies


I have just started to read the paper. If those are the take-home messages, this is simply an awful approach IMHO, on bullet points 1, 2, and to a lesser extent 5. Here are some of the reasons I react this way:

  • ANCOVA gains power and/or precision by explaining outcome heterogeneity by conditioning on pre-existing measurements that relate to Y. The resulting treatment effect is conditional, allowing comparisons of like with like. One should not do anything to uncondition (marginalize; average) the analysis but should make full use of the conditioning at every step.
  • Fitting models separately by treatment group results in suboptimal estimates of residual variance, is highly inefficient statistically, and results in bias. Allowing for interactions with treatment in the primary analysis, if the primary analysis is frequentist, results in a huge power loss and in effect uses a weird prior distribution for heterogeneity of treatment effect.
  • Average treatment effect (ATE) should play no role whatsoever in a randomized clinical trial. The fundamental question answered by an RCT, which is even more clear in a crossover study but is also true in a parallel group design, is “what is the expected outcome were this patient given treatment A vs. were she given treatment B”. As I quote Mitch Gail in Chapter 13 of BBR:


  • The recommendations fail to note a very simple fact. As John Tukey said you can make a great improvement in efficiency by just making up the covariate adjustment, getting the model completely wrong. His example was having clinical investigators make up out of thin air the regression coefficient for a baseline covariate. This will beat unadjusted analyses. Our enemy is inefficient unadjusted analysis and we don’t need to beat ourselves over the head worrying about model misspecification. The unadjusted analysis uses the most misspecified of all models.

Note: as shown in the Gail example in Chapter 13 of BBR, when there is a strong baseline covariate, say sex, the average treatment effect odds ratio is essentially comparing some of the males on treatment A with some of the females on treatment B. Not helpful.

To see the problem with ATE consider how this is usually improperly labeled PATE for population average effect, even though it is calculated from a sample with no reference to a population. For a sample averaged treatment effect to equal PATE, the sample would have to be a random draw from the population, or the analysis would have to incorporate sampling weights. RCTs do not use random sampling, and no one uses sampling weights in RCTs but instead (1) let the RCT sample stand on its own and (2) use proper conditioning as in ANCOVA.

An exceptionally important point for clinical trialists to understand is that, as Gail stated, an ATE that conditions on no covariates and does not use sampling weights will not estimate the correct population parameter in a nonlinear model. Conditional estimates, i.e., treatment effect estimates adjusted for covariates, are the ones that are transportable to patient populations having different covariate distributions than the distributions in the RCT. And if you should have conditioned on 5 apriori important covariates and only conditioned on 4, an odds or hazard ratio for treatment will be averaged over the 5th (omitted) covariate. But unadjusted treatment effects average over all 5 covariates, a much worse problem.

Here is an example to help in understanding the value and robustness of covariate adjustment in randomized experiments. Suppose the following:

  • There are two treatments and the response variable Y is continuous
  • There is one known-to-be-important baseline covariate X, and X is related to Y in a way that is linear in log(x): a rapid rising early phase, then greatly slowing down but never getting glad
  • Ignoring X and having a linear model containing just the treatment term (equivalent to a two-sample t test) results in R^2=0.1
  • Failing to realize that most relationships are nonlinear, the statistician adjusts for X by adding a simple linear term to the model, resulting in a model R^2=0.3
  • The relative efficiency of the unadjusted analysis to the (oversimplified) adjusted analysis is the ratio of residual variances, which is \frac{0.7\sigma^2}{0.9\sigma^2} = 0.78, implying that an adjusted analysis on 0.78n subjects will achieve the same power as the unadjusted analysis on n subjects (here \sigma^2 is the unconditional var(Y))
  • The false linearity assumption for X has resulted in underfitting, but ANCOVA is still substantially better than not adjusting for X; the unadjusted analysis has the most severe degree of underfitting
  • Avoiding underfitting by adjusting for, say, a restricted cubic spline function in X, would result in an additional improvement. For example the R^2 may increase to 0.4, further lowering the residual variance, which decreases the standard error of the treatment parameter in the model and thus narrows the confidence interval and increases power.

Contrast this example with one where Y is binary and treatment effectiveness is summarized by an odds ratio when there is at least one important baseline covariate.

  • The Pearson \chi^2 or less efficient Fisher’s “exact” test for comparing two proportions have a key assumption violated: the probability that Y=1 in a given treatment group is not constant. There is unexplained heterogeneity in Y that could be easily (at least partially) explained by X.
  • The unadjusted analysis refuses to explain easily explainable outcome variation, and as a result the unadjusted treatment odds ratio tends to be closer to 1.0 than the adjusted subject-specific odds ratio.
  • The standard error of the treatment log odds ratio may increase by adjusting for X, but the treatment point estimate will increase in absolute value more than the amount the standard error increases, resulted in an increase in power.
  • Failure to correctly model a nonlinear effect of X will result in a suboptimal analysis that is still superior to the unadjusted analysis, which has the worst goodness of fit (because of it having the most severe underfitting).
  • An adjusted analysis more closely approximates a correct answer to the question “Were this subject given treatment B what do we expect to happen as compared to were she given treatment A?” The unadjusted odds ratio is a sample averaged odds ratio that only answers a question about a mixed batch of subjects. Its interpretation essential relates to asking how some subjects on treatment B with lower values of X compare to some subjects on treatment A with higher values of X. Anyone interested in precision medicine should refuse to compute unadjusted estimates.

On a separate note, it is important to realize that covariate adjustment in randomized trials has nothing to do with baseline imbalance.


  1. The Well Adjusted Statistician by Stephen Senn
  2. Analysis of Covariance in Randomized Studies by Frank Harrell
  3. Statistical Issues in Drug Development by Stephen Senn
  4. Should We Ignore Covariate Imbalance

Dr. Harrell, could you expand a little on your comment about ATE? I agree that what we want to learn from an RCT is “what is the most likely outcome for a particular patient (J) if given treatment A instead of treatment B”. However, from a counterfactual perspective, that is impossible. The individual effect of A on J is not identifiable. This would require for J to be treated with A and also with B at the same time, which is obviously impossible. This is the fundamental problem in identifying generic causes in research, and we address it by estimating average instead of individual effects. Even in the case of cross-over trials or N-of-1 trials all we can identify is the average causal effect. In case of a N-of-1 trial, which seems as close as we can get to an individual effect, what we estimate is the average effect resulting from comparing the average endpoint from multiple periods of treatment with treatment A and the average endpoint from multiple periods of treatment B. In spite of your objection to the use of ATE in RCT, the concept is implicit in the statement that what we want is to learn what is the “expected” outcome if J is given A instead of B. On the other hand, I concur with your concern about PATE in the context of RCT. In fact, I’ve never seem a PATE in this context, though I have the impression your concern about PATE is lost in some text books and articles. Although PATE make no sense in the context of clinical care, they may be useful for the formulation of public health policies, particularly in settings where the cost-benefit of a treatment in a population, as compare to a competitive treatment, depends on the distribution of one or more prognostic factors.


Thanks for the thoughtful discussion Leonelo. I contend that ATE is not even the most useful concept for public health policies, unless the policies are required to be blind to characteristics of individuals, e.g., a mass one-time decision to vaccinate an entire population or to quarantine an entire state. Even for flu vaccination one might consider the elderly and children in day care as different from the rest of the population. As another example, breast cancer risks are conditioned on the sex of the person. Full conditioning on pre-existing information is the way to go in the vast majority of cases IMHO.

Your point about not being able to estimate the counterfactual is a very good one. I’d argue that in the crossover studies, a gold standard for non-permanent therapies, we come as close as we need to estimating patient-specific expected effects. But the fact that the counterfactual cannot be observed outside of a crossover study should not prevent us from trying to estimate it, i.e., what would happen to a specific patient were she given treatment A vs. were the same patient given treatment B. We approximate the counterfactual using a model. You can also think of this as a pure prediction problem blessed by randomization to be unbiased under certain conditions. Given baseline covariates X, what is our best estimate of the expected value of Y given X and treatment? We estimate two expected values for the patient, and by subtraction obtain expected increment in Y for that patient. This increment will be independent of X if there are no interactions and the model is a linear model.

Since the ATE is very problematic with binary outcomes, and other situations where the model in covariates must be nonlinear, such as proportional hazards models, I don’t cling to ATE as a useful concept. Since, if sex is a very strong covariate, the ATE is akin to comparing some of the males on treatment A with some of the females on treatment B, it is not what I need. This odd comparison is exemplified in the Mitch Gail example I quoted in the BBR chapter. If the treatment odds ratio is 9 for males and is 9 for females, but the overall odds ratio is 5.4, and since 5.4 is not a weighted average of 9 and 9, the bleeding of the sex effect into the overall treatment effect estimate is apparent.


Thanks for clarifying, Dr. Harrell. I guess we are mostly in the same page, but I don’t think the concept of ATE implies being blind to individual characteristics. I believe the problem resides on when can we interpret an ATE as a causal effect. If we expect the potential outcome of a treatment to be importantly different in different individuals, then we should not calculate a marginal ATE. This is akin to the basic rule of not taking an average of things that are different. For instance, if a hormone receptor blocker has different effects in men and women, due to intrinsic sex-hormone differences, we should calculate separate ATEs for women and men. The basic problem there is that the potential outcomes for the same treatment differ by gender. In consequence, there are two target populations -men and women- and we should account for this during the design of the RCT. On the other hand, I have yet to go through the Mitch Gail example, but it seems to me -at first glance, that this is not a problem of the ATE, but a problem with the non-collapsibility of the odds ratio. If one could instead use the risk ratio, the overall ATE would be a weighted average of the ATE in men and the ATE in women, and would equal the stratum-specific risk ratios if they are the same.

Good thoughts, and raises several points:

  • Risk ratios are hard to use because they can’t be constant over a broad risk range (e.g., a risk ratio of 2 cannot apply to someone with a baseline risk > 0.5)
  • Non-collapsibility of odds ratios is an issue here
  • ATE can be misleading even if the treatment effect is the same for men and women. Average treatment effect as considered in the causal inference world is on the absolute risk difference scale, but we consider treatment interactions on a scale for which such interactions may mathematically be absent, e.g., an odds ratio scale. The average odds ratio is almost meaningless, while the average absolute difference may not apply to anyone in the study.
  • ATE has been embraced largely because it facilitates causal inference, not because it is helpful in interpreting studies
  • Your correct belief that we should not compute ATE when there is outcome heterogeneity is ignored by all those using ATE
1 Like

I fail to see why this is the case. Suppose we conduct a RCT to assess treatment A compared to treatment B (control) in patients hospitalized for disease D, and first rehospitalization for D as the endpoint. The trial lasts 2 years. Over this time period, 60% of patients in the control group were hospitalized, as compared to 40% in the treatment A group. We conduct a Kaplan-Meier analysis, splitting time by month, and test for the proportional hazards (PH) assumption. Should we expect the PH test to be significant, because the cumulative risk in the control group was >50%?
On the other hand, I recognize that most work about the ATE has focused on the absolute risk difference. However, the concept of ATE does not depend on the use of a specific effect measure (risk ratio, odds ratio, or hazard ratio). Also, when both interacting factors have an effect on the outcome, there will be interaction in one of the two scales (multiplicative or additive). Thus, this does not seem to be a property of the ATE.

You’ll have a hard time finding well developed literature where ATE did not use the (faulty in my view) absolute risk scale. And it usually considers a binary endpoint, so time to event and PH assumption are not relevant. Risk ratios violate basic properties of risk. Instantaneous risk ratios are different. And note that if you fail to adjust for covariates, the PH assumption is more likely to fail.

I have been reading around this topic for a while now.

A question that remains unanswered, at least to me:

I suppose there is possibility of encountering situations where the observed covariates are distributed such that they bias the effect estimate in one direction, but the unobserved covariates are distributed such that they bias the effect estimate in the opposite direction to an extent that they counterbalance the observed covariates.

When you adjust for some covariates (observed) but not others (unobserved/unknown), do you run the risk of producing more biased estimates?

Anything can happen, but we make decisions here in expectation - what’s the best approach in the long run. And the whole discussion of sets of covariates has an identifiability problem because potentially there are infinitely many covariates.

that guideline was specific to RCTs. I’d seek out anything written by stephen senn re randomisation, bias, balance, adjusting etc to reassure you

Thank you,

In a 2005 article Senn wrote “…given that one has taken account of relevant measured prognostic covariates and has randomized, there is no point worrying about the distribution of unmeasured covariates. In the absence of actual distribution in such covariates, the distribution they would have in probability over all randomizations is the appropriate distribution to use.”

I think that’s the essence of what @f2harrell is saying

1 Like