"Exchangeability" in observational studies versus RCTs

Introductory epidemiology courses that physicians take as part of their EBM training don’t always explicitly discuss exchangeability. Yet it seems like this concept might be important for us to understand (in at least a crude way) when reading medical studies. Could misunderstanding and/or poor/inconsistent teaching around this topic underlie some of our bad critical appraisal habits (e.g., comparing columns in Table 1 of RCTs)?

Written explanations of exchangeability become complicated very quickly- it seems like advanced math/stats/epi training is needed for deep understanding. I could only find a few articles on this subject that explained things in terms I could understand. These are a couple of examples:


Is this high-level, non-expert interpretation correct?:

  • exchangeability is an assumption that’s needed for valid causal inference;

  • the assumption is that the treated/untreated or exposed/unexposed groups in an RCT or observational study (respectively) would have experienced a similar distribution of outcomes of interest if we had simply followed them over time without treating/exposing either group (?)

Observational Studies

  • In observational studies, exchangeability can’t be assumed without an attempt to “balance” the exposed/unexposed groups with respect to covariates that are associated with both the likelihood of exposure and the outcome of interest. In other words, in the observational context, attention to balancing of potential confounders is an important method to promote exchangeability;

  • If there is no effort to balance confounders, the observational study will not be measuring what it’s supposed to (i.e, its “internal validity” will be compromised). The study will be measuring more than just the effect of the exposure/intervention and the resulting point estimate will be biased.


  • A mechanically sound randomization process promotes exchangeability;

  • As the number of randomized subjects increases, the theoretical outcome distributions of the two study arms, in the absence of treatment, is expected to become more similar (i.e, exchangeability increases with increasing n ?);

  • When n is small, the theoretical/expected shape of the distribution of outcomes for the two arms, in the absence of treatment, would be expected to be less similar (i.e, the two groups would be less “exchangeable”);

  • When a small study is actually run, this dissimilarity in theoretical outcome distribution between arms (suboptimal exchangeability) manifests as a pattern of actual outcomes that demonstrates a degree of within-arm and between-arm variance/scatter that is greater than that which would have occurred with a larger study. This greater “scatter” of outcomes that we see with a smaller randomized study translates to wider confidence intervals (greater “random error”) and this wider interval protects us from making overly-confident inferences in this context of suboptimal exchangeability (??);

  • Some epidemiologists use the term “random confounding” but it seems controversial (?) Do statisticians and epidemiologists agree on what this means? I never heard it in my training (many moons ago). When I hear the word “confounding,” I think of something that can confer bias i.e., move the point estimate. I can see how an RCT’s point estimate could be affected if suboptimal exchangeability is the result of a broken randomization machine (see below). But assuming the randomization machine is working properly, I can’t see how suboptimal exchangeability due to small n alone would be capable of moving/biasing a study’s point estimate (?)

Proposed genesis of bad MD critical appraisal habits:

  • The term “internal validity” seems to be used, implicitly, as a synonym for “exchangeability” during medical training/introductory epi. courses;

  • When learning how to critically appraise observational studies, we are told that balancing potential confounders between study arms improves internal validity;

  • We extrapolate this teaching and assume that balancing confounders must also improve the internal validity of RCTs (the study type that most influences medical practice);

  • Our teaching and reading materials often state that the reason why randomization promotes internal validity is that it “balances known and unknown confounders;”

  • Published RCTs reinforce this bad habit of hunting for between-arm covariate balance by including a prominent “Table 1,” which shows how certain covariates are distributed between treatment arms;

  • Our habit is further reinforced when we see p values in Table 1, highlighting covariates that appear to be unequally distributed between treatment arms. Many (?most) of us don’t have a firm enough grasp on epi/stats to realize that these p values make no sense- assuming no mechanical failure of the randomization process, any imbalances will, by definition, have occurred by chance;

  • We begin to look at Table 1 not as a way to gain a rough sense of the demographics/comorbidities of the study’s subjects, but rather as a way to identify possible sources of “bias;”

  • Since we learned that between-arm imbalances portend bias in observational studies, we conclude that any marked between-arm imbalance in potentially prognostically important covariates that we see in an RCT must have similarly compromised its internal validity;

  • We write off the RCT result as biased and don’t apply its findings in our practice;

  • MDs can misinterpret data presented in Table 1 of RCTs in at least two ways:

  1. We are trained to search for imbalance in prognostically-important covariates. If we see imbalance (highlighted to us by Table 1 and further emphasized with p values), we conclude that the RCT result might be biased. These beliefs might reflect improper extrapolation of observational study critical appraisal practices to RCT appraisal (?); or

  2. We might assume that covariate imbalances signal randomization failure. But my understanding is that while a mechanical failure of the randomization process could indeed produce between-arm covariate imbalances and that such a failure would indeed compromise the assumption of exchangeability, it doesn’t follow that observed covariate imbalances necessarily indicate randomization failure. Covariate imbalances will always be expected, simply due to chance, even with an intact “randomization machine.” Therefore, examining Table 1 to look for imbalances doesn’t seem like a reliable way to check for randomization failure and can not be used as a way to assess whether the exchangeability assumption is satisfied (?)

Any corrections to the above interpretation would be welcome. It seems important not just to highlight what MDs do wrong during critical appraisal, but also to get to the root of how our misguided practices arise. Maybe this approach will allow EBM teachers to nip them in the bud…


You might find this article by Sander Greenland helpful. The interesting portion for me is his discussion of the subtle distinction between the statistical concept of bias, with the epidemiological concept of confounding.

Addendum: I added [2] after the reply below by @ESMD. I will add other ones as I come across them.

The initial post, and the reply are important to questions address correctly in an way that is formally rigorous, but presented in a way that is helpful to practitioners.

Your question deserves an answer. I’ve searched for a long time, but AFAIK, there isn’t a readily accessible one. There is a large gap in the literature on the meaning of the term “evidence” when used by statisticians and mathematicians vs. the health and medical authorities. They use the same word, but attach very different meanings to it.

I’ve been struggling with this issue you bring up in your reply. I confess to having made pretty much all of the common errors of interpretation until I started learning about Bayesian methods and their application to decision theory.

Correct reasoning in uncertain situations is modeled using relatively advanced mathematical concepts.
Those compelled to make quick decisions (as doctors are) will find these ways of thinking tedious and “impractical” initially. This lead to various mental short cuts that eventually fall apart when attempts are made to scale them.

  1. Greenland, S. (1990). Randomization, Statistics, and Causal Inference. Epidemiology, 1 (6), 421-429. Retrieved May 24, 2020, from www.jstor.org/stable/25759845
    Here is a direct link to the (pdf).

  2. Greenland, S. and Morganstern, H. (2001), Confounding and Health Research, Annual Review of Public Health Vol. 22:189-212 https://www.annualreviews.org/doi/10.1146/annurev.publhealth.22.1.189

  3. Greenland, S., & Robins, J. M. (2009). Identifiability, exchangeability and confounding revisited. Epidemiologic perspectives & innovations : EP+I , 6 , 4. https://doi.org/10.1186/1742-5573-6-4


Thanks Robert

I tried my best to understand the article, but, like many stats/epi publications, it’s written at a level that I doubt most physicians taking introductory EBM courses would understand. This is a key issue in medical education- how to convey complicated concepts to novices in terms that are both understandable and accurate (“everything should be made as simple as possible but no simpler…”). Physicians need a functional grasp of stats/epi so we can decide which research findings to apply in clinical practice. But I think that what happens sometimes is that, in an effort to present material in a form that non-experts can understand, EBM teachers might inadvertently present certain concepts incorrectly to students (who then apply them incorrectly when they get out into practice).

It’s been many years since I looked at the JAMA series “Users’ Guide to the Medical Literature.” At the time it was published, this series was used to teach many medical trainees about critical appraisal. I’m sure that some of the content would generate robust arguments among epi/stats experts today, but it was considered a truly valuable resource to students at the time because it was written at a level that was understandable to its intended audience. Maybe a group of MDs/statisticians/epidemiologists will take the ball and update/revise it at some point…

New- So it turns out all the JAMA articles were eventually published in text format. You can see that the language is very accessible to non-experts (which is great). But if you look at page 89, under the description of “bias,” you’ll see an example of why physicians get confused when they’re told not to look for baseline covariate imbalances as possible sources of bias…


1 Like

But if you look at page 89, under the description of “bias,” you’ll see an example of why physicians get confused when they’re told not to look for baseline covariate imbalances as possible sources of bias…

I wanted to bump this as I found a great blog post that provides an accessible, constructive proof of why attempting to look at an instance of covariate balance to judge the validity of a randomization procedure is mistaken.

Covariate-Based Diagnostics for Randomized Experiments are Often Misleading (link)

Assuming a homogenous population, a large N, a true treatment effect, complete randomization, and K covariates, we can show that:

This leads to a strange disconnect: the outcome of our experiment is always a perfect estimate of the ATE, but we can always detect arbitrarily strong imbalances in covariates. This disconnect occurs for the simple reason that there is no meaningful connection between covariate imbalance and successful randomization: balancing covariates is neither necessary nor sufficient for an experiment’s inferences to be accurate. Moreover, checking for imbalance often leads to dangerous p-hacking because there are many ways in which imbalance could be defined and checked. In a high-dimensional setting like this one, there will be, by necessity, extreme imbalances along some dimensions. For me, the lesson from this example is simple: we must not go looking for imbalances, because we will always be able to find them.

The epistemic state of an individual involved in a randomized experiment is distinct from someone who is reading the results of one, not having been involved in the planning.

Would a rational, Bayesian scientist who managed to convince another agent to set up a randomized experiment to decide a factual question accept these post hoc criticisms, especially when the other agent was permitted to decide and validate whatever randomization procedure was actually used?

Of course, the reader is not in the position to judge the quality of the randomization. He or she might discount the report based on prior knowledge, the adequacy of the report of the randomization procedure, etc.

But acceptance of post hoc arguments on imbalance would mean that no RCT would ever be able to decide a question of fact, since we can always retrospectively find “imbalance” if we look hard enough.

1 Like

Hi Robert

It’s a shame that medical trainees don’t get more training in statistics- without this, it’s very hard to understand explanations of important critical appraisal topics that are math-based. This is why the Users’ Guide series (referenced above) was so popular- concepts were explained simply and generally in a narrative way. The downside of trying to teach a non-expert audience, of course, is that important concepts get twisted in a way that obscures their true complexity, in order to make them easier to digest. Twenty five years after I first read the JAMA series, I am learning from this site that at least some important concepts were presented in a less-than-accurate way (e.g., bias, confidence intervals).

I feel strongly that medical trainees should be taught about certain key critical appraisal topics by statisticians. Unless our teachers show us how difficult it can be to draw valid inferences from medical research, there’s a risk that physicians will embark on their careers confident in their skills yet often wrong. We will remain stuck on the initial peak of the Dunning-Kruger curve, whereas we should probably finish our training somewhere deep in its valley.

For physician audiences, simple simulations would probably work best to convey complicated ideas. Here’s an example that someone linked to recently on Twitter that helped me to get my head around the meaning of confidence intervals:

1 Like