Introductory epidemiology courses that physicians take as part of their EBM training don’t always explicitly discuss exchangeability. Yet it seems like this concept might be important for us to understand (in at least a crude way) when reading medical studies. Could misunderstanding and/or poor/inconsistent teaching around this topic underlie some of our bad critical appraisal habits (e.g., comparing columns in Table 1 of RCTs)?
Written explanations of exchangeability become complicated very quickly- it seems like advanced math/stats/epi training is needed for deep understanding. I could only find a few articles on this subject that explained things in terms I could understand. These are a couple of examples:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.974.6381&rep=rep1&type=pdf
Is this high-level, non-expert interpretation correct?:
-
exchangeability is an assumption that’s needed for valid causal inference;
-
the assumption is that the treated/untreated or exposed/unexposed groups in an RCT or observational study (respectively) would have experienced a similar distribution of outcomes of interest if we had simply followed them over time without treating/exposing either group (?)
Observational Studies
-
In observational studies, exchangeability can’t be assumed without an attempt to “balance” the exposed/unexposed groups with respect to covariates that are associated with both the likelihood of exposure and the outcome of interest. In other words, in the observational context, attention to balancing of potential confounders is an important method to promote exchangeability;
-
If there is no effort to balance confounders, the observational study will not be measuring what it’s supposed to (i.e, its “internal validity” will be compromised). The study will be measuring more than just the effect of the exposure/intervention and the resulting point estimate will be biased.
RCTs
-
A mechanically sound randomization process promotes exchangeability;
-
As the number of randomized subjects increases, the theoretical outcome distributions of the two study arms, in the absence of treatment, is expected to become more similar (i.e, exchangeability increases with increasing n ?);
-
When n is small, the theoretical/expected shape of the distribution of outcomes for the two arms, in the absence of treatment, would be expected to be less similar (i.e, the two groups would be less “exchangeable”);
-
When a small study is actually run, this dissimilarity in theoretical outcome distribution between arms (suboptimal exchangeability) manifests as a pattern of actual outcomes that demonstrates a degree of within-arm and between-arm variance/scatter that is greater than that which would have occurred with a larger study. This greater “scatter” of outcomes that we see with a smaller randomized study translates to wider confidence intervals (greater “random error”) and this wider interval protects us from making overly-confident inferences in this context of suboptimal exchangeability (??);
-
Some epidemiologists use the term “random confounding” but it seems controversial (?) Do statisticians and epidemiologists agree on what this means? I never heard it in my training (many moons ago). When I hear the word “confounding,” I think of something that can confer bias i.e., move the point estimate. I can see how an RCT’s point estimate could be affected if suboptimal exchangeability is the result of a broken randomization machine (see below). But assuming the randomization machine is working properly, I can’t see how suboptimal exchangeability due to small n alone would be capable of moving/biasing a study’s point estimate (?)
Proposed genesis of bad MD critical appraisal habits:
-
The term “internal validity” seems to be used, implicitly, as a synonym for “exchangeability” during medical training/introductory epi. courses;
-
When learning how to critically appraise observational studies, we are told that balancing potential confounders between study arms improves internal validity;
-
We extrapolate this teaching and assume that balancing confounders must also improve the internal validity of RCTs (the study type that most influences medical practice);
-
Our teaching and reading materials often state that the reason why randomization promotes internal validity is that it “balances known and unknown confounders;”
-
Published RCTs reinforce this bad habit of hunting for between-arm covariate balance by including a prominent “Table 1,” which shows how certain covariates are distributed between treatment arms;
-
Our habit is further reinforced when we see p values in Table 1, highlighting covariates that appear to be unequally distributed between treatment arms. Many (?most) of us don’t have a firm enough grasp on epi/stats to realize that these p values make no sense- assuming no mechanical failure of the randomization process, any imbalances will, by definition, have occurred by chance;
-
We begin to look at Table 1 not as a way to gain a rough sense of the demographics/comorbidities of the study’s subjects, but rather as a way to identify possible sources of “bias;”
-
Since we learned that between-arm imbalances portend bias in observational studies, we conclude that any marked between-arm imbalance in potentially prognostically important covariates that we see in an RCT must have similarly compromised its internal validity;
-
We write off the RCT result as biased and don’t apply its findings in our practice;
-
MDs can misinterpret data presented in Table 1 of RCTs in at least two ways:
-
We are trained to search for imbalance in prognostically-important covariates. If we see imbalance (highlighted to us by Table 1 and further emphasized with p values), we conclude that the RCT result might be biased. These beliefs might reflect improper extrapolation of observational study critical appraisal practices to RCT appraisal (?); or
-
We might assume that covariate imbalances signal randomization failure. But my understanding is that while a mechanical failure of the randomization process could indeed produce between-arm covariate imbalances and that such a failure would indeed compromise the assumption of exchangeability, it doesn’t follow that observed covariate imbalances necessarily indicate randomization failure. Covariate imbalances will always be expected, simply due to chance, even with an intact “randomization machine.” Therefore, examining Table 1 to look for imbalances doesn’t seem like a reliable way to check for randomization failure and can not be used as a way to assess whether the exchangeability assumption is satisfied (?)
Any corrections to the above interpretation would be welcome. It seems important not just to highlight what MDs do wrong during critical appraisal, but also to get to the root of how our misguided practices arise. Maybe this approach will allow EBM teachers to nip them in the bud…