Guidelines for Covariate Adjustment in RCTs

i’m susceptible to concise recommendations, thus i like this paper and tempted to share it. “Covariate adjustment for randomized controlled trials revisited

Although it is very difficult to make general recommendation for CA in RCTs, the following suggestion is in order:

  • take potential heterogeneity into account and fit models with interaction terms or to data of different treatment groups separately; although in the special case of 1:1 randomization, there is no asymptotic efficiency gain.43 The later approach is especially useful when a large number of covariates should be considered.
  • Report both adjusted and unadjusted ATE estimates with valid CIs, even when some might be conservative.
  • Use simulations to evaluate/compare the performance of different approaches based on some most possible scenarios, especially when the sample size is not large.
  • Do not exclude subjects because of missing covariate values. Instead, use simple approaches to impute the missing values.
  • Some model/covariate selection approaches for big data with many covariates should be used, if appropriate for a specific scenario.
  • Be transparent with model and covariate selection processes and prespecify them as specific as possible, but do use them when necessary for better treatment effect estimation.

Finally, it should be emphasized that practical experiences play a key role in appropriate use of CA approaches in RCTs.

obviously the rules don’t carry over neatly to obs studies


I have just started to read the paper. If those are the take-home messages, this is simply an awful approach IMHO, on bullet points 1, 2, and to a lesser extent 5. Here are some of the reasons I react this way:

  • ANCOVA gains power and/or precision by explaining outcome heterogeneity by conditioning on pre-existing measurements that relate to Y. The resulting treatment effect is conditional, allowing comparisons of like with like. One should not do anything to uncondition (marginalize; average) the analysis but should make full use of the conditioning at every step.
  • Fitting models separately by treatment group results in suboptimal estimates of residual variance, is highly inefficient statistically, and results in bias. Allowing for interactions with treatment in the primary analysis, if the primary analysis is frequentist, results in a huge power loss and in effect uses a weird prior distribution for heterogeneity of treatment effect.
  • Average treatment effect (ATE) should play no role whatsoever in a randomized clinical trial. The fundamental question answered by an RCT, which is even more clear in a crossover study but is also true in a parallel group design, is “what is the expected outcome were this patient given treatment A vs. were she given treatment B”. As I quote Mitch Gail in Chapter 13 of BBR:


  • The recommendations fail to note a very simple fact. As John Tukey said you can make a great improvement in efficiency by just making up the covariate adjustment, getting the model completely wrong. His example was having clinical investigators make up out of thin air the regression coefficient for a baseline covariate. This will beat unadjusted analyses. Our enemy is inefficient unadjusted analysis and we don’t need to beat ourselves over the head worrying about model misspecification. The unadjusted analysis uses the most misspecified of all models.

Note: as shown in the Gail example in Chapter 13 of BBR, when there is a strong baseline covariate, say sex, the average treatment effect odds ratio is essentially comparing some of the males on treatment A with some of the females on treatment B. Not helpful.

To see the problem with ATE consider how this is usually improperly labeled PATE for population average effect, even though it is calculated from a sample with no reference to a population. For a sample averaged treatment effect to equal PATE, the sample would have to be a random draw from the population, or the analysis would have to incorporate sampling weights. RCTs do not use random sampling, and no one uses sampling weights in RCTs but instead (1) let the RCT sample stand on its own and (2) use proper conditioning as in ANCOVA.

An exceptionally important point for clinical trialists to understand is that, as Gail stated, an ATE that conditions on no covariates and does not use sampling weights will not estimate the correct population parameter in a nonlinear model. Conditional estimates, i.e., treatment effect estimates adjusted for covariates, are the ones that are transportable to patient populations having different covariate distributions than the distributions in the RCT. And if you should have conditioned on 5 apriori important covariates and only conditioned on 4, an odds or hazard ratio for treatment will be averaged over the 5th (omitted) covariate. But unadjusted treatment effects average over all 5 covariates, a much worse problem.

Here is an example to help in understanding the value and robustness of covariate adjustment in randomized experiments. Suppose the following:

  • There are two treatments and the response variable Y is continuous
  • There is one known-to-be-important baseline covariate X, and X is related to Y in a way that is linear in log(x): a rapid rising early phase, then greatly slowing down but never getting glad
  • Ignoring X and having a linear model containing just the treatment term (equivalent to a two-sample t test) results in R^2=0.1
  • Failing to realize that most relationships are nonlinear, the statistician adjusts for X by adding a simple linear term to the model, resulting in a model R^2=0.3
  • The relative efficiency of the unadjusted analysis to the (oversimplified) adjusted analysis is the ratio of residual variances, which is \frac{0.7\sigma^2}{0.9\sigma^2} = 0.78, implying that an adjusted analysis on 0.78n subjects will achieve the same power as the unadjusted analysis on n subjects (here \sigma^2 is the unconditional var(Y))
  • The false linearity assumption for X has resulted in underfitting, but ANCOVA is still substantially better than not adjusting for X; the unadjusted analysis has the most severe degree of underfitting
  • Avoiding underfitting by adjusting for, say, a restricted cubic spline function in X, would result in an additional improvement. For example the R^2 may increase to 0.4, further lowering the residual variance, which decreases the standard error of the treatment parameter in the model and thus narrows the confidence interval and increases power.

Contrast this example with one where Y is binary and treatment effectiveness is summarized by an odds ratio when there is at least one important baseline covariate.

  • The Pearson \chi^2 or less efficient Fisher’s “exact” test for comparing two proportions have a key assumption violated: the probability that Y=1 in a given treatment group is not constant. There is unexplained heterogeneity in Y that could be easily (at least partially) explained by X.
  • The unadjusted analysis refuses to explain easily explainable outcome variation, and as a result the unadjusted treatment odds ratio tends to be closer to 1.0 than the adjusted subject-specific odds ratio.
  • The standard error of the treatment log odds ratio may increase by adjusting for X, but the treatment point estimate will increase in absolute value more than the amount the standard error increases, resulted in an increase in power.
  • Failure to correctly model a nonlinear effect of X will result in a suboptimal analysis that is still superior to the unadjusted analysis, which has the worst goodness of fit (because of it having the most severe underfitting).
  • An adjusted analysis more closely approximates a correct answer to the question “Were this subject given treatment B what do we expect to happen as compared to were she given treatment A?” The unadjusted odds ratio is a sample averaged odds ratio that only answers a question about a mixed batch of subjects. Its interpretation essential relates to asking how some subjects on treatment B with lower values of X compare to some subjects on treatment A with higher values of X. Anyone interested in precision medicine should refuse to compute unadjusted estimates.

On a separate note, it is important to realize that covariate adjustment in randomized trials has nothing to do with baseline imbalance.


  1. Commentary on Improving Precision and Power in Randomized Trials for COVID-19 Treatments Using Covariate Adjustment, for Binary, Ordinal, and Time-to-Event Outcomes by Frank Harrell and Stephen Senn
  2. The Well Adjusted Statistician by Stephen Senn
  3. Analysis of Covariance in Randomized Studies by Frank Harrell
  4. Statistical Issues in Drug Development by Stephen Senn
  5. Should We Ignore Covariate Imbalance
  6. Comments on FDA guidance draft by Jonathan Bartlett

Dr. Harrell, could you expand a little on your comment about ATE? I agree that what we want to learn from an RCT is “what is the most likely outcome for a particular patient (J) if given treatment A instead of treatment B”. However, from a counterfactual perspective, that is impossible. The individual effect of A on J is not identifiable. This would require for J to be treated with A and also with B at the same time, which is obviously impossible. This is the fundamental problem in identifying generic causes in research, and we address it by estimating average instead of individual effects. Even in the case of cross-over trials or N-of-1 trials all we can identify is the average causal effect. In case of a N-of-1 trial, which seems as close as we can get to an individual effect, what we estimate is the average effect resulting from comparing the average endpoint from multiple periods of treatment with treatment A and the average endpoint from multiple periods of treatment B. In spite of your objection to the use of ATE in RCT, the concept is implicit in the statement that what we want is to learn what is the “expected” outcome if J is given A instead of B. On the other hand, I concur with your concern about PATE in the context of RCT. In fact, I’ve never seem a PATE in this context, though I have the impression your concern about PATE is lost in some text books and articles. Although PATE make no sense in the context of clinical care, they may be useful for the formulation of public health policies, particularly in settings where the cost-benefit of a treatment in a population, as compare to a competitive treatment, depends on the distribution of one or more prognostic factors.


Thanks for the thoughtful discussion Leonelo. I contend that ATE is not even the most useful concept for public health policies, unless the policies are required to be blind to characteristics of individuals, e.g., a mass one-time decision to vaccinate an entire population or to quarantine an entire state. Even for flu vaccination one might consider the elderly and children in day care as different from the rest of the population. As another example, breast cancer risks are conditioned on the sex of the person. Full conditioning on pre-existing information is the way to go in the vast majority of cases IMHO.

Your point about not being able to estimate the counterfactual is a very good one. I’d argue that in the crossover studies, a gold standard for non-permanent therapies, we come as close as we need to estimating patient-specific expected effects. But the fact that the counterfactual cannot be observed outside of a crossover study should not prevent us from trying to estimate it, i.e., what would happen to a specific patient were she given treatment A vs. were the same patient given treatment B. We approximate the counterfactual using a model. You can also think of this as a pure prediction problem blessed by randomization to be unbiased under certain conditions. Given baseline covariates X, what is our best estimate of the expected value of Y given X and treatment? We estimate two expected values for the patient, and by subtraction obtain expected increment in Y for that patient. This increment will be independent of X if there are no interactions and the model is a linear model.

Since the ATE is very problematic with binary outcomes, and other situations where the model in covariates must be nonlinear, such as proportional hazards models, I don’t cling to ATE as a useful concept. Since, if sex is a very strong covariate, the ATE is akin to comparing some of the males on treatment A with some of the females on treatment B, it is not what I need. This odd comparison is exemplified in the Mitch Gail example I quoted in the BBR chapter. If the treatment odds ratio is 9 for males and is 9 for females, but the overall odds ratio is 5.4, and since 5.4 is not a weighted average of 9 and 9, the bleeding of the sex effect into the overall treatment effect estimate is apparent.


Thanks for clarifying, Dr. Harrell. I guess we are mostly in the same page, but I don’t think the concept of ATE implies being blind to individual characteristics. I believe the problem resides on when can we interpret an ATE as a causal effect. If we expect the potential outcome of a treatment to be importantly different in different individuals, then we should not calculate a marginal ATE. This is akin to the basic rule of not taking an average of things that are different. For instance, if a hormone receptor blocker has different effects in men and women, due to intrinsic sex-hormone differences, we should calculate separate ATEs for women and men. The basic problem there is that the potential outcomes for the same treatment differ by gender. In consequence, there are two target populations -men and women- and we should account for this during the design of the RCT. On the other hand, I have yet to go through the Mitch Gail example, but it seems to me -at first glance, that this is not a problem of the ATE, but a problem with the non-collapsibility of the odds ratio. If one could instead use the risk ratio, the overall ATE would be a weighted average of the ATE in men and the ATE in women, and would equal the stratum-specific risk ratios if they are the same.

1 Like

Good thoughts, and raises several points:

  • Risk ratios are hard to use because they can’t be constant over a broad risk range (e.g., a risk ratio of 2 cannot apply to someone with a baseline risk > 0.5)
  • Non-collapsibility of odds ratios is an issue here
  • ATE can be misleading even if the treatment effect is the same for men and women. Average treatment effect as considered in the causal inference world is on the absolute risk difference scale, but we consider treatment interactions on a scale for which such interactions may mathematically be absent, e.g., an odds ratio scale. The average odds ratio is almost meaningless, while the average absolute difference may not apply to anyone in the study.
  • ATE has been embraced largely because it facilitates causal inference, not because it is helpful in interpreting studies
  • Your correct belief that we should not compute ATE when there is outcome heterogeneity is ignored by all those using ATE
1 Like

I fail to see why this is the case. Suppose we conduct a RCT to assess treatment A compared to treatment B (control) in patients hospitalized for disease D, and first rehospitalization for D as the endpoint. The trial lasts 2 years. Over this time period, 60% of patients in the control group were hospitalized, as compared to 40% in the treatment A group. We conduct a Kaplan-Meier analysis, splitting time by month, and test for the proportional hazards (PH) assumption. Should we expect the PH test to be significant, because the cumulative risk in the control group was >50%?
On the other hand, I recognize that most work about the ATE has focused on the absolute risk difference. However, the concept of ATE does not depend on the use of a specific effect measure (risk ratio, odds ratio, or hazard ratio). Also, when both interacting factors have an effect on the outcome, there will be interaction in one of the two scales (multiplicative or additive). Thus, this does not seem to be a property of the ATE.

You’ll have a hard time finding well developed literature where ATE did not use the (faulty in my view) absolute risk scale. And it usually considers a binary endpoint, so time to event and PH assumption are not relevant. Risk ratios violate basic properties of risk. Instantaneous risk ratios are different. And note that if you fail to adjust for covariates, the PH assumption is more likely to fail.

I have been reading around this topic for a while now.

A question that remains unanswered, at least to me:

I suppose there is possibility of encountering situations where the observed covariates are distributed such that they bias the effect estimate in one direction, but the unobserved covariates are distributed such that they bias the effect estimate in the opposite direction to an extent that they counterbalance the observed covariates.

When you adjust for some covariates (observed) but not others (unobserved/unknown), do you run the risk of producing more biased estimates?

Anything can happen, but we make decisions here in expectation - what’s the best approach in the long run. And the whole discussion of sets of covariates has an identifiability problem because potentially there are infinitely many covariates.

that guideline was specific to RCTs. I’d seek out anything written by stephen senn re randomisation, bias, balance, adjusting etc to reassure you

Thank you,

In a 2005 article Senn wrote “…given that one has taken account of relevant measured prognostic covariates and has randomized, there is no point worrying about the distribution of unmeasured covariates. In the absence of actual distribution in such covariates, the distribution they would have in probability over all randomizations is the appropriate distribution to use.”

I think that’s the essence of what @f2harrell is saying

1 Like

There has been a lot of discussion in the past few months about how covariates should be taken into account in RCTs and which statistical model should be used. Some people are advocating the use of additive linear models because they like “causal estimands” such as average treatment effects on a probability scale (when the outcome is binary). There needs to be a clear message about why and how covariate adjustment should be undertaken.

As shown in references discussed in the analysis of covariance chapter in BBR, the purpose of covariate adjustment when the outcome is categorical or represents time-to-event is to prevent a loss of statistical power in the face of outcome heterogeneity. If there are strong prognostic factors such as age and there was a range of ages of study subjects, there will be outcome heterogeneity due to the subjects’ age distribution. Conditioning on the ages will prevent power loss.

Risk differences and risk ratios are useful concepts for homogeneous samples, but not so much for heterogeneous ones. In the classic two-homogeneous-samples problem, the choice of treatment effect metric matters less, and covariate adjustment is not needed because in a homogeneous sample either the covariates all have a single value or the covariate effects on outcome are all zero.

On the question of model choice, the logistic model has been found on the average to provide a better fit to the patient outcomes than other models. That is primarily because it places no constraints on any of the regression coefficients. A useful metric for choosing a model is the likelihood ratio \chi^2 statistic for the combined effect of all two-way interactions involving treatment. One will typically see that this measure of outcome variability due to interactions is smaller for the no-constraint logistic model than it is for models stated in terms of risk differences or log risk ratios. I have more faith in the model needing the fewest departures from additivity (lowest amount of improvement in log-likelihood by adding all the treatment interactions).

A well-fitting model that allows for there to possibly be no heterogeneity of treatment effect provides the best basis for assessing whether there is such heterogeneity. It is possible for a treatment odds ratio to be constant, whereas it is not possible for an absolute risk reduction or risk ratio to be constant across a wide variety of patient types. Sicker patients have larger absolute risk reduction from effective treatments, and a risk ratio cannot exceed 2 when a patient has a baseline risk of 0.5.

For a simple example showing why conditional treatment effects are more appropriate than unconditional ones, see this. This example is one where conditional (adjusted) treatment effects are much more interpretable and relevant than the marginal treatment effect ignoring the covariate.

:new: Some are advocating the use of population-averaged treatment effects in RCTs. Not only are such measures non-transportable as demonstrated in the simple example above, but they are not even calculable. To understand why I say this you need to understand how advocates are mislabeling what is really sample-averaged treatment effects with the term population-averaged treatment effect. Sample-averaged effects are dependent on the study’s inclusion criteria and seem to be used only because they are easy to compute. But to compute the desired population-averaged treatment effect when the RCT did not randomly sample patients from the population one must know the sampling probabilities in order to do the proper weighted estimation to obtain population-averaged treatment effects. For example, if patient volunteers tended to be younger and more white than the population at large we would need to know the probability of being sampled as a function of age and race. These sampling probabilities would then be used to compute population-averaged effects.

RCTs use convenience samples, which is fine because the role of RCTs is to estimate relative efficacy and safety, and these quantities can be estimated from highly non-random samples as discussed here. Advocates of average treatment effects are thus telling us that

  • conditional relative efficacy measures that are patient-specific, transportable, and form the proper basis for examining heterogeneity of treatment effect while being easier to compute and interpret are not good enough
  • population-averaged treatment effects should be used instead, and the advocates are not able to tell us how to compute them so they pretend that sample-averaged effects are the target of inference

Another way to understanding the choice of effect scale (absolute risk reduction - ARR; RR - risk ratio; odds ratio - OR; HR - hazard ratio, for example) and the role of covariate adjustment is to consider the statistical model that was envisioned when statistical tests and estimators were first derived for the two-sample problem. In all the statistical texts describing these statistical methods, there is an iid (independent and identically distributed outcomes) assumption. Identically distributed pertains, within a given treatment group, to each patient having the same outcome tendency, e.g. the same probability of a binary outcome. So within treatment group, patients are assumed to be homogeneous and covariates are either constant or have no effect. In either case these covariates may be ignored. The traditional estimators for the two-sample binary Y problem are differences in means and differences in proportions (ARR) or risk ratios (RR). These estimators assume homogeneity of patients within treatment group. When there is an impactful dispersion of baseline covariates, iid no longer holds and quantities such as ARR and RR that ignore covariate adjustment no longer apply. ORs and HRs, on the other hand, exist for multivariable heterogeneous situations, readily allowing adjustment for covariates without placing restrictions on the effects of covariates and treatment. This lack of restriction is due to the unlimited range of logistic and Cox model coefficients which also explains why such relative treatment effect models do not require interactions to be present just to keep probabilities inside [0,1].

To say this another way, differences in proportions (ARR) are nice summaries of treatment effects in the homogeneous case but when there is outcome heterogeneity they fail to recognize that higher risk patients usually get higher absolute treatment benefit. Standard covariate adjustment on the unrestricted relative treatment effect basis satisfactorily addresses all these issues, the the relative treatment effect can easily be translated into a patient-specific absolute benefit by simply subtracting two logistic models.


On the question of model choice, the logistic model has been found on the average to provide a better fit to the patient outcomes than other models.

Logistic models as opposed to what other models? Any references for this?

As opposed to additive risk models, risk ratio models, and probit regression. Sorry I don’t have references, just a career in which I’ve seen many examples to support my view. We can examine this question empirically once we agree how to quantify model fit. My suggestion is the proportional of explainable log likelihood that is explained without interactions. A strong example may be found here where the optimum penalty for all factors interacting with treatment was found to be \infty, i.e., there is no evidence for non-additivity, i.e., a simple logistic model fits very well in a very large RCT.

One question that arises for me is how to specify the multivariable model in a survival analysis. One has to decide things like how I analyze continuous covariates, in a linear or non-linear way, or how many variables I include in the model. How do you do this in practice? For example, do you estimate the sample size according to the Schoenfeld equation, as usual, and then adjust for a vector of covariates chosen according to the number of degrees of freedom available to spend, as you recommend in your book for observational studies? Everything you recommend in the RMS book would be valid in this scenario? Do I have to report the unadjusted effect? If I communicate both hazard ratios, are there multiplicity issues? … etc

For covariate adjustment, overfitting is not as large a problem as when developing a clinical prediction model. You might use a rough rule of thumb such as at least 5 events per parameter to be estimated. For baseline variables thought to be especially important (e.g., age and extent of disease) we often pre-specify 3 or 4 knots in a restricted cubic spline function. You do have to allocate degrees of freedom careful to capture the main prognostic variables and certain nonlinear terms.

And is the sample size usually calculated for a univariable model (e.g., using Schoenfeld’s formula)?

Yes we just assume that the hazard ratio is the adjusted hazard ratio even though his formula is for the unadjusted one. Sample size is for the treatment effect, then check that expected number of events will support covariate adjustment.

I have the feeling that when the drug does not go for registration, the researchers first try the naive approach, and if they don’t succeed, then they explore multivariable models. For example, the controversial Bilcap clinical trial for cholangiocarcinoma in The Lancet:

Here the main end point was analyzed with an univariable Cox model. The p-value was >0.05 but they “pre-specified” (whatever it means) a multivariable sensitivity analysis.
Therefore you have two results, adjusted and unadjusted… but the later one being a sensitivity analysis…
Does this make sense?