"Exchangeability" in observational studies versus RCTs

Introductory epidemiology courses that physicians take as part of their EBM training don’t always explicitly discuss exchangeability. Yet it seems like this concept might be important for us to understand (in at least a crude way) when reading medical studies. Could misunderstanding and/or poor/inconsistent teaching around this topic underlie some of our bad critical appraisal habits (e.g., comparing columns in Table 1 of RCTs)?

Written explanations of exchangeability become complicated very quickly- it seems like advanced math/stats/epi training is needed for deep understanding. I could only find a few articles on this subject that explained things in terms I could understand. These are a couple of examples:


Is this high-level, non-expert interpretation correct?:

  • exchangeability is an assumption that’s needed for valid causal inference;

  • the assumption is that the treated/untreated or exposed/unexposed groups in an RCT or observational study (respectively) would have experienced a similar distribution of outcomes of interest if we had simply followed them over time without treating/exposing either group (?)

Observational Studies

  • In observational studies, exchangeability can’t be assumed without an attempt to “balance” the exposed/unexposed groups with respect to covariates that are associated with both the likelihood of exposure and the outcome of interest. In other words, in the observational context, attention to balancing of potential confounders is an important method to promote exchangeability;

  • If there is no effort to balance confounders, the observational study will not be measuring what it’s supposed to (i.e, its “internal validity” will be compromised). The study will be measuring more than just the effect of the exposure/intervention and the resulting point estimate will be biased.


  • A mechanically sound randomization process promotes exchangeability;

  • As the number of randomized subjects increases, the theoretical outcome distributions of the two study arms, in the absence of treatment, is expected to become more similar (i.e, exchangeability increases with increasing n ?);

  • When n is small, the theoretical/expected shape of the distribution of outcomes for the two arms, in the absence of treatment, would be expected to be less similar (i.e, the two groups would be less “exchangeable”);

  • When a small study is actually run, this dissimilarity in theoretical outcome distribution between arms (suboptimal exchangeability) manifests as a pattern of actual outcomes that demonstrates a degree of within-arm and between-arm variance/scatter that is greater than that which would have occurred with a larger study. This greater “scatter” of outcomes that we see with a smaller randomized study translates to wider confidence intervals (greater “random error”) and this wider interval protects us from making overly-confident inferences in this context of suboptimal exchangeability (??);

  • Some epidemiologists use the term “random confounding” but it seems controversial (?) Do statisticians and epidemiologists agree on what this means? I never heard it in my training (many moons ago). When I hear the word “confounding,” I think of something that can confer bias i.e., move the point estimate. I can see how an RCT’s point estimate could be affected if suboptimal exchangeability is the result of a broken randomization machine (see below). But assuming the randomization machine is working properly, I can’t see how suboptimal exchangeability due to small n alone would be capable of moving/biasing a study’s point estimate (?)

Proposed genesis of bad MD critical appraisal habits:

  • The term “internal validity” seems to be used, implicitly, as a synonym for “exchangeability” during medical training/introductory epi. courses;

  • When learning how to critically appraise observational studies, we are told that balancing potential confounders between study arms improves internal validity;

  • We extrapolate this teaching and assume that balancing confounders must also improve the internal validity of RCTs (the study type that most influences medical practice);

  • Our teaching and reading materials often state that the reason why randomization promotes internal validity is that it “balances known and unknown confounders;”

  • Published RCTs reinforce this bad habit of hunting for between-arm covariate balance by including a prominent “Table 1,” which shows how certain covariates are distributed between treatment arms;

  • Our habit is further reinforced when we see p values in Table 1, highlighting covariates that appear to be unequally distributed between treatment arms. Many (?most) of us don’t have a firm enough grasp on epi/stats to realize that these p values make no sense- assuming no mechanical failure of the randomization process, any imbalances will, by definition, have occurred by chance;

  • We begin to look at Table 1 not as a way to gain a rough sense of the demographics/comorbidities of the study’s subjects, but rather as a way to identify possible sources of “bias;”

  • Since we learned that between-arm imbalances portend bias in observational studies, we conclude that any marked between-arm imbalance in potentially prognostically important covariates that we see in an RCT must have similarly compromised its internal validity;

  • We write off the RCT result as biased and don’t apply its findings in our practice;

  • MDs can misinterpret data presented in Table 1 of RCTs in at least two ways:

  1. We are trained to search for imbalance in prognostically-important covariates. If we see imbalance (highlighted to us by Table 1 and further emphasized with p values), we conclude that the RCT result might be biased. These beliefs might reflect improper extrapolation of observational study critical appraisal practices to RCT appraisal (?); or

  2. We might assume that covariate imbalances signal randomization failure. But my understanding is that while a mechanical failure of the randomization process could indeed produce between-arm covariate imbalances and that such a failure would indeed compromise the assumption of exchangeability, it doesn’t follow that observed covariate imbalances necessarily indicate randomization failure. Covariate imbalances will always be expected, simply due to chance, even with an intact “randomization machine.” Therefore, examining Table 1 to look for imbalances doesn’t seem like a reliable way to check for randomization failure and can not be used as a way to assess whether the exchangeability assumption is satisfied (?)

Any corrections to the above interpretation would be welcome. It seems important not just to highlight what MDs do wrong during critical appraisal, but also to get to the root of how our misguided practices arise. Maybe this approach will allow EBM teachers to nip them in the bud…


You might find this article by Sander Greenland helpful. The interesting portion for me is his discussion of the subtle distinction between the statistical concept of bias, with the epidemiological concept of confounding.

Addendum: I added [2] after the reply below by @ESMD. I will add other ones as I come across them.

The initial post, and the reply are important to questions address correctly in an way that is formally rigorous, but presented in a way that is helpful to practitioners.

Your question deserves an answer. I’ve searched for a long time, but AFAIK, there isn’t a readily accessible one. There is a large gap in the literature on the meaning of the term “evidence” when used by statisticians and mathematicians vs. the health and medical authorities. They use the same word, but attach very different meanings to it.

I’ve been struggling with this issue you bring up in your reply. I confess to having made pretty much all of the common errors of interpretation until I started learning about Bayesian methods and their application to decision theory.

Correct reasoning in uncertain situations is modeled using relatively advanced mathematical concepts.
Those compelled to make quick decisions (as doctors are) will find these ways of thinking tedious and “impractical” initially. This lead to various mental short cuts that eventually fall apart when attempts are made to scale them.

  1. Greenland, S. (1990). Randomization, Statistics, and Causal Inference. Epidemiology, 1 (6), 421-429. Retrieved May 24, 2020, from www.jstor.org/stable/25759845
    Here is a direct link to the (pdf).

  2. Greenland, S. and Morganstern, H. (2001), Confounding and Health Research, Annual Review of Public Health Vol. 22:189-212 https://www.annualreviews.org/doi/10.1146/annurev.publhealth.22.1.189

  3. Greenland, S., & Robins, J. M. (2009). Identifiability, exchangeability and confounding revisited. Epidemiologic perspectives & innovations : EP+I , 6 , 4. https://doi.org/10.1186/1742-5573-6-4


Thanks Robert

I tried my best to understand the article, but, like many stats/epi publications, it’s written at a level that I doubt most physicians taking introductory EBM courses would understand. This is a key issue in medical education- how to convey complicated concepts to novices in terms that are both understandable and accurate (“everything should be made as simple as possible but no simpler…”). Physicians need a functional grasp of stats/epi so we can decide which research findings to apply in clinical practice. But I think that what happens sometimes is that, in an effort to present material in a form that non-experts can understand, EBM teachers might inadvertently present certain concepts incorrectly to students (who then apply them incorrectly when they get out into practice).

It’s been many years since I looked at the JAMA series “Users’ Guide to the Medical Literature.” At the time it was published, this series was used to teach many medical trainees about critical appraisal. I’m sure that some of the content would generate robust arguments among epi/stats experts today, but it was considered a truly valuable resource to students at the time because it was written at a level that was understandable to its intended audience. Maybe a group of MDs/statisticians/epidemiologists will take the ball and update/revise it at some point…

New- So it turns out all the JAMA articles were eventually published in text format. You can see that the language is very accessible to non-experts (which is great). But if you look at page 89, under the description of “bias,” you’ll see an example of why physicians get confused when they’re told not to look for baseline covariate imbalances as possible sources of bias…



But if you look at page 89, under the description of “bias,” you’ll see an example of why physicians get confused when they’re told not to look for baseline covariate imbalances as possible sources of bias…

I wanted to bump this as I found a great blog post that provides an accessible, constructive proof of why attempting to look at an instance of covariate balance to judge the validity of a randomization procedure is mistaken.

Covariate-Based Diagnostics for Randomized Experiments are Often Misleading (link)

Assuming a homogenous population, a large N, a true treatment effect, complete randomization, and K covariates, we can show that:

This leads to a strange disconnect: the outcome of our experiment is always a perfect estimate of the ATE, but we can always detect arbitrarily strong imbalances in covariates. This disconnect occurs for the simple reason that there is no meaningful connection between covariate imbalance and successful randomization: balancing covariates is neither necessary nor sufficient for an experiment’s inferences to be accurate. Moreover, checking for imbalance often leads to dangerous p-hacking because there are many ways in which imbalance could be defined and checked. In a high-dimensional setting like this one, there will be, by necessity, extreme imbalances along some dimensions. For me, the lesson from this example is simple: we must not go looking for imbalances, because we will always be able to find them.

The epistemic state of an individual involved in a randomized experiment is distinct from someone who is reading the results of one, not having been involved in the planning.

Would a rational, Bayesian scientist who managed to convince another agent to set up a randomized experiment to decide a factual question accept these post hoc criticisms, especially when the other agent was permitted to decide and validate whatever randomization procedure was actually used?

Of course, the reader is not in the position to judge the quality of the randomization. He or she might discount the report based on prior knowledge, the adequacy of the report of the randomization procedure, etc.

But acceptance of post hoc arguments on imbalance would mean that no RCT would ever be able to decide a question of fact, since we can always retrospectively find “imbalance” if we look hard enough.


Hi Robert

It’s a shame that medical trainees don’t get more training in statistics- without this, it’s very hard to understand explanations of important critical appraisal topics that are math-based. This is why the Users’ Guide series (referenced above) was so popular- concepts were explained simply and generally in a narrative way. The downside of trying to teach a non-expert audience, of course, is that important concepts get twisted in a way that obscures their true complexity, in order to make them easier to digest. Twenty five years after I first read the JAMA series, I am learning from this site that at least some important concepts were presented in a less-than-accurate way (e.g., bias, confidence intervals).

I feel strongly that medical trainees should be taught about certain key critical appraisal topics by statisticians. Unless our teachers show us how difficult it can be to draw valid inferences from medical research, there’s a risk that physicians will embark on their careers confident in their skills yet often wrong. We will remain stuck on the initial peak of the Dunning-Kruger curve, whereas we should probably finish our training somewhere deep in its valley.

For physician audiences, simple simulations would probably work best to convey complicated ideas. Here’s an example that someone linked to recently on Twitter that helped me to get my head around the meaning of confidence intervals:


Erin and Robert, thank you both for raising this question. Your discussion sent me back to Pearl’s (2009) Causality, 2ed. §6.5.3 “Exchangeability versus Structural Analysis of Confounding,” and from there to Greenland & Robins (1986) — the prequel to Robert’s citation [3] above.

The latter piece opens up with the observation,

To our knowledge, no one has presented a theory of epidemiological confounding
based on a model for individual effects—a somewhat surprising state of affairs,
since a theory of synergy and antagonism has been derived from such models.

Greenland & Robins then proceed to develop a concrete exploration of identifiability, exchangeability and confounding, initially in the context of a single person. (Only after the first 1.5 pages does the setting expand to the population level.) I would hope clinicians could find this type of development immediately accessible by mapping it to the type of thinking (e.g., diagnostic) they employ in the patient encounter.

Incidentally, you may be interested to learn—as I did only recently—that James Robins’s inspiration came out of his experience as a physician (Richardson & Rotnitzky 2014).


Hi David

Thanks for the feedback and the links. Maybe I’m just unusually dense, but I’d wager that very few MDs have enough training in stats/epi to understand articles written at this level. I’d go even further to speculate that even a large majority of our EBM teachers might not understand them. This is not at all a criticism of the article, which is clearly intended for a sophisticated audience and sounds like it speaks to really fundamental epi concepts.

At the end of the day, most physicians will probably have to be satisfied with a crib-sheet approach to critical appraisal without necessarily understanding the complex underlying mathematical/statistical proofs of many concepts. This approach isn’t very satisfying for people who like to understand everything at a “first principles” level (including many physicians). But the unfortunate reality is that there just isn’t enough time in a medical school curriculum to teach the background that’s needed to understand the origin of the rules.

Of course, this high-level approach to learning becomes problematic if EBM teachers don’t have a firm enough understanding of the origin of the rules to provide students with a correct crib sheet. What might be missing is skilled translators who have sufficient mastery of underlying first principles that they can speak simply, yet accurately, to a novice audience about core critical appraisal concepts. Right now, I suspect many physicians get simple but incorrect teaching. And because concepts we’re taught sound simple, we become overly confident about the depth of our understanding and become defensive when that understanding is questioned. Maybe the only way to get simple AND accurate teaching is to have certain portions of EBM training outsourced to practising statisticians/epidemiologists.


I can understand this desire for heuristics to assess studies. But the big point that scholars like @Sander Dr. Michael Rawlins have tried to point out is that “evidence” is context-sensitive. [1]

The notion that evidence can be reliably or usefully placed in ‘hierarchies’ is illusory. Rather, decision makers need to exercise judgement about whether (and when) evidence gathered from experimental or observational sources is fit for purpose.

The following quote from [2] strikes me as the ideal of “evidence based medicine”

It’s about integrating individual clinical expertise and the best external evidence

I think a reasonable question to ask is: How can an individual defend an action as “rational” and justified? This very question is asked in the interdisciplinary field of normative decision theory.

Discussions about “evidence” take place in the context of making a decision. Without knowledge of possible courses of action, costs of various actions, and the rankings of all stakeholders involved in the decision, ‘evidence’ is a very ambiguous concept that gets thrown around a lot, but without much content.

Decision theory is where I started my journey on understanding these issues from the perspective of cognitive ergonomics. I think the Standford Encyclopedia of Philosophy entry on Decision Theory is a great place to start to get to understand some of the complexities in this area. [3]

  1. Rawlins, M. (2008) De testimonio : on the evidence for decisions about the use of therapeutic interventions The Lancet https://doi.org/10.1016/S0140-6736(08)61930-3

  2. Sackett, D et al. (1996). Evidence based medicine: what it is and what it isn’t BMJ 1996;312:71

  3. Decision Theory (2015) Stanford Encyclopedia of Philosophy

Erin, my first instinct on seeing your original post was that ‘exchangeability’ was too esoteric a concept, much like ‘strong ignorability’ as criticized by Pearl in §11.2.3 of Causality 2ed. (2009). Thus, just as Pearl proposes more concrete and accessible replacements to “demystify” strong ignorability, I was thinking maybe the same would best be done for exchangeability. Having returned to that view after our short exchange above, I now find myself having second thoughts after encountering a meaningful application of exchangeability in Hobbs &al (2018).

Hobbs BP, Kane MJ, Hong DS, Landin R. Statistical challenges posed by uncontrolled master protocols: sensitivity analysis of the vemurafenib study. Ann Oncol. 2018;29(12):2296-2301. doi:10.1093/annonc/mdy457 [Open access]