Causal inferences from RCTs- could “toy” clinical examples promote understanding?

This recent publication discusses the pitfalls of “responder analysis” in the context of trials of anti-amyloid therapies for Alzheimer’s disease.

https://alz-journals.onlinelibrary.wiley.com/doi/10.1002/alz.14457

In general, I find statistics-based explanations of responder analysis pretty confusing. I know these analyses are bad practice and recognize them when I see them. But, as a physician, graphs aren’t the method I rely on to understand why they are bad. Maybe I’m thinking about things in an overly-simplified way, but it feels like the whole issue boils down to a causality assessment category error (?the correct term) that many inexpert researchers/readers make when analyzing RCT results. Everyone has heard the expression “RCTs can show causality,” but I strongly suspect that many people misunderstand its meaning. In turn, I think that this fundamental misunderstanding fuels a lot of other bad practices e.g., responder analysis.

If I had to summarize the key message I took from this paper, in layman’s terms (so as to make it understandable for the average clinician), I would say:

“RCT analysis is a comparison between groups. In nearly all cases (unless the trial involves repeated crossovers or an N-of-1 design), the “causality” that we infer from a trial result is causality at the level of groups, NOT at the level of individual patients.

The purpose of a trial is to find treatments that work. In order to approve a drug, regulators only need to be convinced that it has the potential to help people (NOT a specific person). For this reason, only very rarely will a trial be designed in such a way that we will be able to validly assess “responsiveness” of individual trial participants to the therapy being tested.

The vast majority of trials, given the way they are designed, are only capable of telling us, through between-group comparisons of outcome rates, how confident we can be that the therapy has intrinsic efficacy. If we have observed a reasonable number of outcomes of interest and if the between-group difference in outcome rates is sufficiently great, then we can infer that the drug being tested was responsible for that difference i.e., that the drug being tested “caused” one group to have better outcomes than the other group. But in the absence of a crossover or N-of-1 design, the trial result is not capable of telling us whether the therapy “caused” any particular outcome for any particular trial patient (with very rare exceptions e.g., unanticipated immediate adverse events e.g., anaphylaxis soon after administration).

As soon as we start trying to examine RCT results in a more granular way than comparing where patients randomized to trial arm “A” ended up at the end of the trial, as compared with patients in trial arm “B,” we enter dangerous territory. If researchers are not being closely supervised by a well-trained statistician, they might naively start digging deeper into the trial results by analyzing what happened during the trial within each arm. And from there, it’s just a short leap to analyzing change scores for individual patients. And, once their statistical misadventure is this far advanced, it becomes very easy to succumb to the alluring (yet mistaken) idea that these “observed” changes in the scores of individual patients were “caused” by the treatment. Researchers/readers who are this far down the rabbit hole can then start believing the most extreme forms of nonsense (e.g., “We don’t even need a control group to get the answers we’re looking for!”). If researchers who make this error are also physicians, then we can conclude that they have not internalized the profound implications of a phenomenon that they witness every day in clinic- namely, that many disease states fluctuate in severity/change over short periods of time even in the absence of therapeutic changes. A patient does not need to be enrolled in an RCT in order for us to see his clinical score change from one day/week/month to the next.

This is my interpretation of what the authors refer to in their paper as “causal fraud.” I understand “responder analysis” as the endpoint of a cascade of statistical misadventure, culminating in commission of a category error which confuses individual-level causality with group-level causality. I think many people observe that a change in score has occurred for an individual trial participant during a single period of exposure in a trial and assume, since the person was “randomized” to the therapy they received, that it was the therapy that caused the change. They confuse individual-level causality (NOT demonstrable in the vast majority of trials given the way they are designed) with group-level causality (which IS demonstrable given typical designs). As a result, they don’t appreciate that the act of analyzing, in great detail, how individual patients’ scores changed during the trial is an exercise in futility. In order to avoid falling into these inferential traps, it’s important for researchers to resist the urge to “go deep” when analyzing RCTs. Rather, they need to remain “zoomed out,” focusing their attention on the between-group comparison when making inferences about causality.

2 Likes