Should we ignore covariate imbalance and stop presenting a stratified 'table one' for randomized trials?

This would be a huge improvement on the current way papers are presented. But the only way to convince journals is for it come from a weighty consortium of statisticians


Trialists should endeavor to inspect the quality of their randomization.

Why? Two seemingly identical trials can lead to conflicting results if a very important covariate is stratified in one trial and imbalanced in the other AND that covariate is an effect modifier.

My complaint about current practice:

  1. I agree that presenting a “table one” devotes far too much time and too much attention to the relevance of balance. In the paper, the statistician should just say, “Randomization was inspected.”

  2. I think better effort must be made to describe and measure prognostic factors that could be confounders when deploying the treatment in a non-participant population.

  3. A better way to inspect randomization than doing billions of T-tests for frequency differences by treatment assignment is to actually redo the main analysis using some form of adjustment or weighting. Theoretically, they should have no impact and it reduces the inference down to a single comparison rather than many.

Well, I am convinced! Thank you for the really clearly articulated explanation. Not sure there can be a counter argument for keeping stratified table 1’s in response to this. I really like the proposed summary presentation of baseline covariates with outcome to provide richer information but may not be ready to get rid of a table 1 in its recognisable form with pooled data yet describing the participants.This suggestion could be used for a new table 2 or later?

1 Like

I’m entirely on board with these suggestions.
What’s your favorite citation for justifying this approach when you (inevitably) get reviewer comments asking for a Table 1 stratified by treatment allocation and filled with P-values?

Fun question! It is an inevitable request but less so from journals in my experience and more often from investigators. I have in the past used an early Altman paper and a paper by Senn. I used to work in Dougs unit and came from the approach not to test, but its ok to look- now convinced by your philosophical argument not too look. Next interesting challenge will be to start proposing this recommendation. A published article needed- anything written yet on this?

The point about effect modification is important. But it’s not that ignoring this is completely wrong; more so it’s that the treatment effect will be a kind of average over levels of the interacting factor, so will not apply well to individuals even though it’s right “on the average”.

To say “randomization was inspected” is not specific enough. The randomization process should be inspected and mentioned. Not covariate balance, which just reflects random occurrences.

Instead of “describe and measure prognostic factors” I’d say “pre-specify the most important prognostic factors” and adjust for them come hell or high water. This also does away with the need for your 3.


There are 4 references in Section 13.1 of BBR.

I agree with this. When Adam first suggested it I got the feeling that “randomization inspected” is too vague and is moving away from being more open, which I feel is counter to what a lot of people in research are trying to do.

I find the arguments here very convincing. I’ve been thinking about using graphical dsiplays for “Table 1” too and this seems a much better way to communicate the important information than tedious and hard to interpret tables. I guess the question is really: what is this table(or collection of graphs) for? I would agree with Frank that it’s primarily about describing the population in the trial; I suspect most people would say that it’s about assessing balance and the success of randomisation - which is crazy really because we are almost always sure that our randomisation processes are sound, and most of the variables in “Table1” don’t play any role in randomisation, so it’s probably unrealistic to expect them to show up any problems.

The NEJM justification for using p-values in Table 1 was to detect problems in randomisation or possible malpractice ( which sounds a bit of a stretch to me.


A very interesting discussion! Particularly if we are analyzing these data using frequentist methods, I come down on the side of pre-specifying known important prognostic factors and planning to adjust for them irrespective of any imbalances that may or may not arise in the data circumstantially.

My reasoning is as follows: In a frequentist world, we need to ask ourselves to think about a hypothetical scenario under which the study at hand were repeated infinitely many times under identical circumstances. If one’s decision–through inferential, graphical, or descriptive summary measures–to place a prognostic variable into a model is based on imbalances (perceived or real) in the observed data, he or she has obfuscated his or her standard errors. Confidence intervals from the model will not have valid coverage. What’s more, if you are modeling a parameter using a non-collapsible link function, the parameter you target is not invariant across study replicates–namely, you change it each time you make a different decision about which variables to adjust for.

If a particular prognostic factor is not balanced across treatments, we know that this is due to random chance by virtue of the randomized nature of treatment. Under a frequentist framework, that same variable would show a comparable amount of imbalance in the opposite direction in some other theoretical study replicate almost surely. Strictly speaking, then, any observed imbalance in any factor (prognostic or otherwise) across treatment levels does not induce a bias if left unadjusted for. Rather, imbalances in any covariates are one (of several) factors that explain the overall variability of the estimate.

Finally, adjustment for known prognostic factors is understood in many settings to increase power, and so doing this (i.e., planning to do this) is unlikely to hurt you.


This is a very nice angle, getting at the fundamental meaning of sampling distributions when using frequentist inference. I read you as saying that if one looks for observed imbalances, and does anything about it, the sampling distribution is ill-defined because there will be by definition different covariates imbalanced at each study replication, and positive imbalances can even turn into negative ones.

1 Like

Hi @SpiekerStats, thanks for sharing this additions lens on the issue. If I understand this correctly, it seems that a similar problem (described in the quoted text) might occur if the model were modified in some other way, e.g., by modifying the link function or considering a transformation of the response, or by modifying the error structure. If so, how do we reconcile this with our use of model diagnostics in this context?


I agree with this conclusion–particularly if these choices are tied to the data and if different hypothetical study replicates would potentially cause you to make different choices. It’s really such a tough thing, because this problem seemingly demands not just that you get the model right, but that you get it right the first time without ever using your data to guide you.

To the extent possible, I try to rely on a pre-specified semi-parametric regression methods, together with splines. That one can consistently estimate the coefficient(s) corresponding to treatment effects even if the splines for the prognostic variables are not correctly specified is a huge help. An a priori choice to employ Huber-White errors would further prevent me from worrying unduly about a mis-specified error structure. If I really want to estimate a difference in means, it’s going to be very hard to convince me to change the link function away from the identity or to transform the outcome post-hoc, not just because I’m changing a model, but because I’m changing the question at hand. I hope these choices help justify my personal tendency (preference?) not to give model diagnostics a great deal of consideration for most of the association studies I work on.

But then, once we’re in a world where we need to start relying on random draws from a predictive distribution (e.g., in parametric g-computation), this gets so much more dicey, and the problem of what to do when confronted with evidence of an incorrect model really comes to life. Absent nonparametric alternatives, the simplest potential remedy I can think of is to use a training set for model building and learning–and then using the trained model structure on the remainder of the data to obtain the quantitative evidence about the parameter at hand and, in turn, to derive conclusions. There are obvious shortcomings–not the least of which is the presumption of enough data to justify the power cost :slight_smile:.


A lot of food for thought. For non-statistician readers, the link function is a very important part of the model. In the logistic model the link is the logit (log odds) function \log(\frac{p}{1-p}) and in the Cox proportional hazards model it is log hazard or equivalently log-log survival probability. In ordinary regression the link function is the identify function, e.g., we model mean blood pressure directly from a linear combination of (possibly transformed) covariates.

When there is no uncertainty about the link function, the Bayesian approach has some simple ways to handle uncertainty about the model. For example, if one were unsure about whether treatment interacts with age, an interaction term that is “half in” the model could be specified with Bayes by putting a skeptical prior distribution on the interaction. This logic also applies to accounting for a list of covariates that you hope are not that relevant, by penalizing (shrinking) the effects of these secondary covariates using skeptical priors.

When the link function itself is in question, you can essentially have a Bayesian model that is a mixture of two models and let the data “vote” on these. For example you could have a logit and a log-log link. The resulting analysis would not give you the simple interpretation we often seek, e.g., a simple odds ratio for treatment. Odds ratios and all other effect measures would be a function of the covariates. But one could easily estimate covariate-specific odds ratios and covariate-specific absolute risk differences in this flexible setup.

The key is pre-specification of the model structure. In the frequentist world we must specify the complete list of covariates to adjust for before examining any relationships between X, treatment, and Y. With Bayes we can pre-specify a much more complex model that includes more departures from basic model assumptions. As n \rightarrow \infty, this will reduce to the simple model should that model actually fit, or it will automatically invoke all the model ‘customizations’ including not borrowing much information from males to estimate the treatment effect for females should a sex \times treatment interaction be important. For smaller n, such a Bayesian model would discount the interaction term which results in borrowing of information across sexes to estimate the treatment effect. To take advantage of the Bayesian approach one must pre-specify the general structure and the prior distributions of all the parameters that are entertained. That’s where clinical knowledge is essential, e.g., how likely is it apriori that the treatment effect for males is more than a multiple of 2 of the treatment effect for females?


Doesn’t fitting the conditional model change your the meaning of your parameter because inference is conditional on the other covariates in the model? I.e. E(Y|T=1, AGE=30) - E(Y|T=0, AGE=30) is not equal to E(Y|T=1) - E(Y|T=0), except with identity link. But, it seems the goal of the clinical trial is to estimate E(Y|T=1) - E(Y|T=0) (or E(Y|T=1)/E(Y|T=0) ). I think @SpiekerStats you are right that you need to fit the correct mean model to get probably smaller and lower variance standard error estimates (even if using robust SE estimates), but I think the target parameter should still be the same E(Y|T=1) - E(Y|T=0), instead of interpreting the parameter for treatment from the logisitic regression model.

1 Like

To me the goal of a parallel-group randomized clinical trial is to answer this question: do two patients starting out at the same point (same age, severity of disease, etc.), one on treatment A and one on treatment B, end up with the same expected outcomes? This is fundamentally a completely conditional model.


Then you are interested E(Y | T=1, COV=covariates) - E(Y | T=0, COV=covariates), or maybe for categorical Y, E(Y | T=1, COV=covariates)/E(Y | T=0, COV=covariates)? It seems like you’ll eventually want to take the expectation over the covariates, because you will be applying the treatment in a population whose covariates you cannot control and the outcome may not be independent of covariates.

1 Like

Here are my thoughts on the marginal vs. conditional issue :slight_smile:. These are two reasonable examples of how I think the goal of a randomized trial could be articulated:

  • To evaluate the extent to which treatment A, when given to the population, tends to result in more favorable outcomes on average as compared to treatment B.
  • To evaluate the extent to which a patient subgroup of covariate level(s) X = x would tend to have a more favorable outcome on average when given treatment A, as compared to a patient subgroup of the same covariate level(s), but given treatment B.

If one believes the first to be the goal, she would use a marginal model. If another believes the second to be the goal, he would be a conditional model. If one did not think of the question in advance and/or looked at a Table 1 to find imbalances that are necessarily purely random, they may be inclined to try both–and I think this really highlights my fundamental concern: lack of pre-specification may lead a researcher to make different modeling decisions at each hypothetical study replicate, thereby rendering associated standard errors invalid due to an ill-defined sampling distribution.

But back to the specific issue raised here: In the case that a linear model is used to analyze the trial, the advantage of a conditional model is the increased power due to improved precision. It turns out, as you have noted, that the value of the target parameter doesn’t change even if you condition on X. In the case of non-collapsible link functions, the marginal and conditional parameter are not equal (they usually don’t drastically differ in my experience). However, the power of the conditional model is typically higher when adjusting for prognostic variables (because of increased precision, and because the value of the conditional parameter often, but not always, tends to be further from the null).

To me, both the marginal and conditional parameters are defensible targets, and my reason for preferring a conditional model is usually more pragmatic than philosophical (increased power, better use of resources).

You raised an important issue, though: that you may be applying the treatment to a population whose covariates you cannot control. I think this is an argument in favor of having a representative sample, which necessarily means thinking very hard about having inclusion/exclusion criteria that are too restrictive. It also means, in may cases, post-marketing surveillance and Phase IV trials.


Nicely put Andrew. For reasons given by Mitch Gail in a quote I put in the ANCOVA chapter in BBR I think the conditional estimate is what is always needed:


The premise of this argument is very well taken (i.e., that there is only one patient in the room and, at his or her time of treatment, he or she is the only one of interest to the clinician). From this premise (and ignoring other potential challenges for the time being), I derive a slightly different conclusion, though, than Gail. Specifically, this premise strikes me as more of an argument in favor of allowing for effect modification between covariates and treatment than simply an argument for a model conditional on covariates.

Suppose we have evidence that a treatment is effective. Were I asked to explain my decision to administer this treatment to an individual patient from the population, I would use exactly the same response irrespective of whether the estimate of the treatment effect were derived from a marginal or conditional model. I may say, for instance: “I’m deciding to administer this drug or therapy to you because I have sufficient evidence that doing so will, on average, tend to result in more favorable outcomes for this overall patient population.” That is to say that, even with the model conditional on covariates, you’re still stuck with a single estimate of the treatment’s benefit, regardless of what you know about the patient in the room.

That there is one patient at a time on whom to decide, to me, suggests that a model allowing for effect modification gets further to the heart of the matter, such that I can update my justification for treating (or not treating) depending upon the covariate profile of the patient in the room, and truly make patient-specific decisions. With a model that allows for effect modification, a clinician can then modify the above statement as: “I’m deciding to administer this drug or therapy to you because I have sufficient evidence that doing so will, on average, tend to result in more favorable outcomes for the patient subpopulation having the same set of covariates as you.”

Naturally, deciding on potential effect modifiers and weighing evidence in favor of or against effect modification is another challenge (perhaps for another post!). :slight_smile:

1 Like