Generalizability vs. Transportability in Trials

I agree with your thoughts. Causal object explication clarifies “what exactly is the effect we are defining?” and external validity address a different problem, which is “what population is our causal object defined over?” As eligibility changes in a trial, these trials report different causal objects because the population component differs and that population specification is where external validity enters.

If our causal object is:

E[Y(1)−Y(0)|S=1]

then asking whether it generalizes to the broader population is asking about a different causal object:

E[Y(1)−Y(0)]

So external validity in analytical studies is essentially about applicability of one causal object to another population – and I propose we use the term applicability throughout – replacing transportability with causal applicability

Came across a nice article from 2000, written by David Sackett:

Sackett DL. Why randomized controlled trials fail but needn’t: 2. Failure to employ physiological statistics, or the only formula a clinician-trialist is ever likely to need (or understand!). CMAJ. 2001 Oct 30;165(9):1226-37. PMID: 11706914; PMCID: PMC81587.

Reference #13 was a bit hard to find, but I managed to dig it up- it seems relevant to this thread:

Sackett DL. Pronouncements about the need for “generalizability” of randomized control trial results are humbug. Control Clin Trials 2000;21:82S

PRONOUNCEMENTS ABOUT THE NEED FOR “GENERALISABILITY” OF RANDOMISED CONTROL TRIAL RESULTS ARE HUMBUG

David L. Sackett

Trout Research & Education Centre at Irish Lake

Markdale, Ontario, Canada

When randomised control trials (RCTs) are used to inform clinicians’ decisions about

which treatments to offer individual patients, cautionary pronouncements that their results

must be “generalisable” (including those my colleagues and I proposed 2 decades ago)

are humbug twice over. First, experiences on both my clinical service at the John

Radcliffe Hospital in Oxford, UK and the >100 services I have visited at other hospitals

have documented that front-line clinicians do not want to “generalise” an RCT’s results

to all patients, but only to “particularise” its results to their individual patient. They do this

by making bedside adjustments to the RCT’s results by using their clinical judgements

about their patient’s unique risk, responsiveness, and values. Thus, “generalisability” to

all patients is irrelevant to frontline clinicians. Second, cautionary pronouncements

about generalisability should have credibility only if the failure to achieve it leads to

QUALITATIVE differences (in KIND) of responses in which experimental therapy is, on

average, unambiguously helpful inside the trial but equally unambiguously harmful or

powerfully useless, on average, to similar patients outside it. QUANTITATIVE

differences (in the DEGREE of help or harm) are to be expected, but are routinely

handled by bedside adjustments using clinical judgement. Hence this challenge: a free

dinner goes to the first person who provides the author with 6 convincing examples of

qualitative differences in the average responses of randomised patients (in RCTs with,

say, at least 100 events) and eligible-but-not-randomised patients outside it. Since there

have probably been more than 250,000 RCTs in health care over the last half century,

if generalisability is really so important, this dinner should be easy to win.

1 Like

Interesting. Hard to say if he fully understood the role of randomization in RCTs (see recent related three part series on why the EBM movement missed this from its inception: part 1, part 2, part 3) but I would bet he did not. The clinical scenarios here include examples of patients that would have been eligible for the KEYNOTE-564 RCT (>100 events) and would not benefit from the therapy it tested versus placebo, as well as examples that would have been ineligible for the KEYNOTE-564 RCT and would benefit from the therapy it tested versus placebo.

1 Like

Thanks for the links to the Matthews articles. We can learn so much from studying the history of science. It’s particularly valuable to identify the historical roots of common scientific/statistical misconceptions.

There’s no question that Sackett understood the value of randomization. But, as Matthews pointed out, even prominent proponents of randomization (like Hill) didn’t necessarily appreciate all the reasons why it’s so valuable (specifically, the fact that it allows us to estimate the uncertainty that accompanies a study result).

I don’t see that any of the patient cases in your paper undermine Sackett’s main point. He seems to be saying that qualitative interactions tend to be such rare clinical phenomena that physicians, in general, shouldn’t let a nebulous fear of this type of interaction drive treatment decisions in the postmarket setting. Of course, physicians should always ensure that their patient’s disease shares the biological mechanism that will drive the therapeutic effect (Sackett makes this point nicely). Your cases mainly highlight the importance of considering the impact of quantitative interaction (risk magnification) when advising specific patients about whether to take a therapy.

A clinical scenario that would satisfy Dr.Sackett’s challenge is the patient with a rare/idiosyncratic physiologic feature (metabolic/immunologic/genetic) that renders him, and perhaps only a very small number of other patients in the world, susceptible to a particular adverse reaction (e.g., a severe hypersensitivity reaction) when exposed to a particular therapy. This type of patient-by-treatment qualitative interaction is often (?usually) so rare as to not be identifiable during typically-sized clinical trials. In the postmarket setting, this patients might appear “similar” to clinical trial subjects with regard to the biologic mechanism driving his disease. But, if offered the therapy, he will react adversely to it, even though the trial showed a beneficial effect, on average, at the group level. These types of qualitative interactions (idiosyncratic vulnerability of certain patients to ADRs) tend to be so rare, and their mechanisms so rarely understood, that they are almost never predictable in clinical settings. Therefore, clinical decision-making at the level of an individual patient will almost never be affected by knowledge that an adverse qualitative interaction like this can occur.

2 Likes

Yup, in the grand scheme of things I think that Sackett et al. were on the right side of history to advocate for RCTs and there are few things we disagree with. But I do suspect that if he was aware of Fisher’s versus Neyman’s work on RCTs and the related nuances vis-a-vis how they were incorporated in medicine, he might have been a less enthusiastic proponent, e.g., of NNTs. A stronger emphasis on quantifying uncertainty estimates versus point estimates of comparative metrics reveal the deficiencies of NNTs, at least during the time Sackett was advocating for them.

Perfection of course is the enemy of good. But as evidenced by Matthews’s article series, at some point in the past few decades this distortion ultimately led to the rise of randomized non-comparative trials (RNCTs). Our group of clinicians just had a strong debate with biostatisticians of an outside trial sponsor insisting on using an RNCT design. This culminated with one of our clinicians (who helped lead the development of the related agent and study) resigning from participating in the RNCT portion of the study. There is something fundamentally wrong in medicine when we are forced to resort to such measures over a trial design flaw that should be obvious to all stakeholders.

2 Likes

Yes, I agree. I too noticed the mention of “number needed to treat” in the Sackett article. It’s always profoundly disconcerting when a well-respected person (as Sackett undeniably was) writes something like that- my response is always :woman_facepalming:, since I then start questioning what else the person might have misconstrued…

The concepts of “NNT,” “responder analysis,” and “RNCT” are clearly very closely intertwined. I’ve come to view these horribly flawed practices like diseases that will never be eradicated until their root cause is identified and the original conceptual errors openly acknowledged. Unfortunately, statistics seems to be a field that’s terribly resistant to course-correction…

I’m not sure that these bad practices are necessarily caused by a failure to understand the nuances of Fisher’s writings (though I’m sure those who have read Fisher deeply wouldn’t perpetuate them). Rather, at their root seems to be a failure to understand the type of evidence needed to identify causality at the level of groups versus individuals. This distinction should be easily discernible by any practising physician (as noted in the Causal inferences from RCTs- could “toy” clinical examples promote understanding? thread), but clearly it’s not. Clearly, even well-cited statisticians make the error (see the paper cited in post #236 of the Individual response thread). I’m convinced that this is the exact conceptual misunderstanding that needs to be aggressively targeted in order to eradicate these bad practices.

2 Likes

Yup, my coarse view is that there is a chain running from Hill to Cochrane to Sackett. Cochrane was deeply influenced by Bradford Hill’s teaching on epidemiology and RCTs at the London School of Hygiene and Tropical Medicine, and always acknowledged Hill’s important influence in introducing him to the principles of using RCTs to obtain unbiased estimates of treatment effects. Cochrane’s Effectiveness and Efficiency is essentially a philosophical and ideological foundation for the work of Sackett et al.

But Fisher and Hill valued randomization for fundamentally different epistemological reasons. Fisher saw randomization as necessary for the valid interpretation of statistical significance tests, while Hill saw randomization primarily as a means to prevent biased estimates of treatment effects. The EBM movement generally adopted a somewhat naive amalgam: randomization is good because it prevents bias (Hill) AND allows valid statistical testing (Fisher), without rigorously distinguishing these two claims or engaging with the tensions between them.

With that in mind, there is an underappreciated inversion in Sackett’s approach to generalizability that I very much conceptually sympathize with: it was distinctively patient-centered rather than population-centered. The fundamental question was not “Is my population the same as the trial population?” but rather “Is there a compelling reason why the results of this trial should not apply to my individual patient?” Many eligibility criteria are administrative or logistical rather than biological. Instead, the clinician should ask whether there is a biologically compelling reason why the treatment effect would differ in their patient. The emphasis on biology and pathophysiology here is notable. Rather than requiring that your patient be represented in the trial sample (which would make almost every trial inapplicable to almost every patient), the framework placed the burden on articulating a specific biological mechanism by which the treatment effect would differ. This is indeed very much in alignment with the current thread.

But this is also where an additional emphasis on the mathematical underpinnings of statistical inference can strengthen the framework: Sackett in his bet above implicitly assumes that if there is no qualitative interaction (e.g., treatment does not become harmful), we can apply the RCT result. But this ignores that a quantitative interaction could still mean the treatment effect is negligibly small in our patient’s subgroup (reference class) while the harms remain constant.

The time is now to more organically integrate the views of RCTs as bias-reducers and as inference engines. Stay tuned for more on this topic :slight_smile:

5 Likes

Fantastic discussion. I would like to not see the term quantitative interaction applied as it has been above. Risk magnification does not represent any kind of interaction unless people insist on analyzing interaction on the wrong scale.

A key problem with NNT can be reduced to misunderstandings of many statisticians on the simple 2\times 2 table and what Pearson’s \chi^2 test was designed for. Pearson’s test (and all of its variants) starts with an assumption that every patient in a treatment group has the same probability of the event. The same assumption is made by non-covariate-specific NNT. Likewise, the t test assumes that every patient within a treatment group has a measurement having a normal distribution with the same mean and variance as all other patients in the treatment group. These assumptions of outcome homogeneity make naive statistical calculations a lot less meaningful than researchers (including statisticians) assume, and invalidate uncertainty measures.

@Stephen has written about some of Sackett’s errors.

3 Likes

Indeed. I attended a talk by Julia Josse on this issue!

I do not find Sackett’s formulation particularly illuminating. A randomized trial estimates an average treatment effect, not the effect for any specific patient. This is true whether or not an individual patient participated in the trial. The relevant question is not trial participation, but whether the causal mechanism tested in the trial is operative in the patient under consideration.

Consider a trial of streptomycin for alveolar tuberculosis. A patient with a streptomycin-resistant strain may have been enrolled in the trial or not; what matters is resistance, because resistance removes the causal pathway by which streptomycin exerts benefit. Conversely, a physician may reasonably generalize results to chronic cavitary TB if the same antimicrobial mechanism is plausibly operative. Causal coherence, not cohort membership, is the decisive issue.

This is the modern pinch point. Care is increasingly protocol-driven rather than physician-specific, so trial results are rapidly generalized to entire populations. Generalization now sits high on the patient-risk hierarchy, and individual clinicians may not be in a position to protect patients from false transport.

A recent example illustrates the danger. Evidence-based guidelines recommended “high PEEP” ventilator strategies for severe COVID-19 pneumonia, “extrapolated” (generalized) from ARDS trials in which COVID pneumonia did not exist. These dangerous protocol-level generalizations were later abandoned, yet the methodological lesson was largely ignored.

In contemporary practice, RCT protocols move directly to bedside care through guideline committees. Trialists, statisticians, and consensus panels therefore occupy positions of substantial decision-making power. At that level of influence, semantics about “similar populations” and the difference between transport and generalization are no substitute for explicit causal modeling and transport analysis.

2 Likes

That makes sense, and I’ll add that a randomized trial provides the best available estimate of the likely benefit of therapy for an individual patient. That is, until more covariates come along that should have been included in the covariate adjusted treatment comparison. For a normally-distributed continuous outcome Y, the mean Y in a group of similar patients is the best available estimate of the outcome for an individual in the group, if low mean squared error is your goal.

2 Likes

There is literally nothing you said that disagrees with Sackett’s approach…

2 Likes

The NEJM Special Article that introduced the concept of NNT in 1988 examined subgroups of a hypertension trial. The authors noted that patients with target organ damage appeared to derive more absolute benefit from antihypertensive drugs than those without target organ damage. They used these observations to argue that clinicians would get more “bang for their buck” when treating patients who already had target organ damage. They expressed “bang for your buck” in terms of NNT:

Laupacis A, Sackett D, Roberts R; An Assessment of Clinically Useful Measures of the Consequences of Treatment: N Engl J Med 1988;318:1728-1733

“The “number needed to be treated” is the number of patients who must be treated in order to prevent one adverse event. For example, in the Veterans Administration trial, if 100 control patients without target-organ damage had been followed for three years (risk of adverse event 0.098), 10 events should have been expected. If, however, 100 such patients had been treated with antihypertensive agents and followed for three years (risk of adverse event, 0.040), only four events would have been expected. Thus, on average, treating 100 such patients for three years would have prevented six (10-4) adverse events, meaning that 17 patients (100 divided by 6) would have had to be treated in order to prevent one event. However, similar calculations reveal that among patients with initial target-organ damage, only seven would have had to be treated for three years in order to prevent one event.

Mathematically, the number needed to be treated is equivalent to the reciprocal of the absolute risk reduction. The number needed to be treated has the same advantage over the relative risk reduction and odds ratio as the absolute risk reduction in that it expresses efficacy in a manner that incorporates both the baseline risk without therapy and the risk reduction with therapy…”

As a physician without expertise in statistics or epidemiology, I feel like I’ve never been able to internalize, deeply enough, statistical criticisms of NNT. The criticisms are multiple, but it feels like there must be a single “root” conceptual error, perpetrated by the developers of the technique, without which the technique would never have caught on.

Question: Is the most fundamental problem with NNT the fact that it relies on an assumption that the absolute event rate in a given RCT arm would remain constant from one trial to the next?

The second sentence of the first paragraph reads: “For example, in the Veterans Administration trial, if 100 control patients without target-organ damage had been followed for three years (risk of adverse event 0.098), 10 events should have been expected.” But later in their paper, the authors describe one of several “shortcomings” of the NNT as follows:

“…Fourth, any measure of the benefit of treatment may vary considerably in different trials of the same or similar therapy because of different patient populations, trial designs (e.g., whether the therapy is evaluated in a setting designed to maximize compliance or as part of routine patient care), or chance…”

Isn’t this “shortcoming” a fatal flaw of the NNT concept (??) Doesn’t this “shortcoming” completely undermine the purpose of calculating NNT (??) Am I missing something (?)

The authors clearly (but indirectly) acknowledge the fact that another trial, designed with the same eligibility criteria, and testing the same therapy, might have recorded quite a different event rate in the control arm. Yet, for some reason, they don’t consider the fact that the group-level ARR can vary from one trial to another to be a “deal-breaker” when they advise extrapolating it to decision-making at the level of an individual patient (?)…

Does a failure to recognize the profound implications of the non-transportability (?correct wording) of the risk difference also lie at the root of the RNCT phenomenon (?) If people believe that an event rate documented in a trial’s control arm is a “constant” (i.e., that the control arm event rate will be the same from one trial to the next), then they will fail to understand the purpose of concurrent control (?)

2 Likes

No, that is not a valid criticism. The RD can be computed from the OR and any baseline risk and so any particular baseline risk scenario can be combined with the trial odds ratio to compute a causally applicable RD (or its reciprocal, the NNT).

Correct, the statistical limitations of NNTs are numerous but their fatal flaw in my practice is that they are unnecessary and cumbersome. Why would I need to invert the absolute risk reduction and navigate the conceptual maze of that inversion instead of simply spending time explaining the absolute risk reduction? I have never met a patient who is familiar with NNTs but not risks in their daily lives.

@Stephen has written an excellent post here on why the inferential gymnastics of NNTs are not worth the mental space for clinicians or patients.

2 Likes

Suhail- I’m not disputing the practice of combining an individual patient’s estimated baseline risk for an event with an RCT-derived OR in order to estimate the personal absolute risk reduction he might expect with therapy. Rather, I’m disputing the idea that a group-level ARR, as derived from an RCT, can be “translated/transported” (or whatever the right word is) to an individual patient in the postmarket setting to estimate his personal ARR.

Within-arm event rates will differ between trials, as will the between-arm difference in event rates, even when the trials are designed identically. Therefore, the NNT calculated from these identically-designed trials plausibly might end up differing substantially (!)

I’m highlighting the fact that the ARR derived from an RCT is a group-level assessment of absolute risk reduction that should NOT be conflated with the absolute risk reduction that an individual patient in the postmarket setting might incur. An RCT-derived ARR reflects the covariate distribution of its underlying convenience sample and this distribution can change from trial to trial. Therefore, “NNT” is highly situational and dependent on features of the recruited convenience sample. As such, NNT should NOT be “transported” to individual patients in the postmarket setting for the purpose of therapeutic decision-making (?)

1 Like

Erin, I cannot see how there could be a group level ARR when by definition it is derived from the difference between two groups? Perhaps you mean group level risk?

1 Like

One other general problem is that NNT envisions a group of patients almost as if they are competing with each other to see who wins. I don’t like any measure that involves that. I’ve heard patients say, when told they have a 5 chances out of 100 of surviving 2 months, “I know I’m one of the 5”. We need to stick with absolute risk in [0, 1] as well as life expectancies IMHO.

1 Like

Not quite sure what you’re getting at? There’s a difference in the event rate between arms of an RCT. The fact that this difference will change from trial to trial, even if the trials are designed identically, makes the notion that there’s a universal “NNT” for a given therapy very problematic (?) Contrast these trial-based risk differences with the absolute reduction in risk that we might estimate for an individual patient in our clinic if he were to be offered treatment in the post-trial setting. I’m just saying that these trial-level risk differences will not necessarily translate to the risk reduction that our individual patient might expect (yet this is what the “NNT” is suggesting we should do…).

Okay I will try to be very clear. There is on average a risk in each arm and on average a RD in each trial. This RD will vary from trial to trial based on each trials baseline risk (on average) because the RD is variation dependent. To avoid that a trial must deliver a variation independent effect (OR) from which we derive the clinically relevant RD based on our estimated patient related baseline risk. We are not talking about risk reduction here (which is a different ball game) but rather risk difference (RD). So I see no issues with absolute measures such as RD so long as they are computed properly. No one would report a RD alone in an RCT - it will always be paired with a relative effect measure and the RD, if reported, is applicable to patients like those in the trial only.

1 Like