Generalizability vs. Transportability in Trials

ESMD · March 2, 2026, 2:38pm

Suhail- I think you’re trying to create a disagreement where there is none . I agree with everything you’ve written here! But “NNT”=1/ARR and ARR will change from trial to trial, so I’m just saying that the notion that there’s “one true NNT to rule them all” for a given therapy is misguided. Therefore, we shouldn’t be using “NNT” to decide whether it’s worthwhile to apply a therapy to an individual patient…

s_doi · March 2, 2026, 3:10pm

Okay now I get your point: The RD (aka ARR) and its transforms are point of care measures of applicability and are not considered trial specific measures even when reported in trials and are therefore not usually meta-analyzed except in error.

ESMD · May 30, 2026, 11:33pm

In a recent three-part series of blog posts, statistician Maggie Qian discusses the series of articles by Robert Matthews that you have cited:

https://evidenceinthewild.com/randomisation-origin-myth/

Her writing style is refreshingly clear and incisive.

Pavlos_Msaouel · May 31, 2026, 1:28pm

Nice summary indeed but oversimplifies since many statisticians (including in the DataMethods forum) would reasonably counterargue in favor of using parametric approaches in practice while keeping the randomization framework in mind (see also discussion here).

The crystal clear telltale sign, also noted by Robert Matthews but not the blog posts, that Fisher’s framework is indeed too often misunderstood in medical RCTs is the emergence and advocacy by a surprisingly large fraction of biostatisticians of randomized non-comparative trials (RNCTs). While we can debate about the merits of pure randomization-based inference, there is no convincing excuse in favor of RNCTs.

Pavlos_Msaouel · June 1, 2026, 5:56pm

Outstanding @Stephen paper on the topic of randomization versus parametric inference coincidentally just pointed out by @R_cubed in this thread.

evidenceinthewild · June 3, 2026, 7:28am

Thank you, Pavlos. The oversimplification point is fair, and I take it on board. The blog format does push toward cleaner narratives than the debate warrants, and I probably underplayed how reasonable the parametric counterargument is in practice.

The RNCT observation is sharp, and I wish I’d included it. You’re right that it’s noted by Matthews but I missed its value as a diagnostic. It cuts through a lot of the abstract framework debate by posing a concrete question: if randomization’s inferential purpose requires a comparison group, what exactly is a randomized non-comparative trial doing? The fact that this design exists, is advocated, and gets published reveals something about how the justification for randomization is being understood, or misunderstood.

My one hesitation about calling it a crystal clear sign is that the motivation behind RNCTs isn’t always straightforwardly inferential confusion. Your linked thread suggests they are often recommended at oncology workshops when there aren’t enough resources to power a comparative trial — randomization retained as a kind of procedural legitimacy device even when comparison is abandoned. That might be a different kind of error: conflating the procedural and inferential roles of randomization, rather than a direct misreading of Fisher. Though perhaps that distinction doesn’t rescue the practice much. If practitioners can’t articulate why they are randomizing when there is no comparison, that itself reflects exactly the misunderstanding of Fisher’s framework you’re describing, and may even strengthen rather than soften your point.

I’ll look more carefully at the RNCT thread; it seems like a natural extension of the series.

Pavlos_Msaouel · June 3, 2026, 2:20pm

Correct. See also related presentation here to the International Society for Biopharmaceutical Statistics (ISBS) with practical examples how understanding Fisher’s inferential machinery can lead to practice-changing RCTs. Connection with information theory here (more extended discussion here). We have also shown here that the quasirandomness fallacy occurs in more than half of oncology phase 3 RCTs and it is increasing over time. Concurrently, the incidence of RNCTs is also increasing.

R_cubed · June 3, 2026, 10:57pm

I find it strange, even suspicious, that balanced/minimized designs are not recommended by biostatisticians in these particular cases of scarce resources.

Fisher and Gossett debated the merits of balanced vs randomized designs going on 100 years ago. They were more recently discussed in the CONSORT guidelines in 2010. The first minimization algorithms are over 50 years old, with improvements being made more recently. Even if one has a preference for large randomized trials, why don’t controlled allocation designs such as minimization get used more in these resource constrained scenarios?

Related thread

simongates · June 9, 2026, 1:01pm

I’m involved in lots of rare disease trials and have had the same thought. I’m not aware of any studies of the potential benefits of balanced/deterministic allocation design versus randomised design in the context of small trials (apart from a trial referred to by Pavlos in a different thread on this forum).

simongates · June 9, 2026, 1:31pm

Disturbing. Anecdotally, I do hear oncologists talking about treatment effects in isolation (rather than in a comparative way) a lot. It seems worse in oncology than elsewhere, in my (biased and limited) experience.

Pavlos_Msaouel · June 9, 2026, 1:45pm

Yes, and unfortunately it mainly originates from oncology biostatisticians. They argue that everything discussed here is just philosophy…

R_cubed · June 9, 2026, 2:00pm

Donald Taves has a number of relatively recent papers on minimization techniques that discuss addressing selection bias.

Taves D (2012) Rank-Minimization with a two-step analysis should replace randomization in clinical trials Journal of Clinical Epidemiology, 65, 3-6 (link)

One caution: I’m not convinced his discussion of the data analysis is completely correct, in that he suggests examination based upon all characteristics that were balanced can reduce power “if the variables are not related to the outcome”. He makes this argument based upon significance tests – ie. he would condition only upon variables that are “significant” and drop the others. I doubt this is correct.

Nathan Kallus presented a paper to the Royal Statistical Society based on a new method he described as Kernel Allocation, which was his doctoral research project. To examine the data, he derived a bootstrap hypothesis test.

I haven’t explored the mathematical consequences myself, but I’d think ether a bootstrap test or a permutation test would be valid for a minimized design, but I suspect the frequentists would object to the permutation method.

f2harrell · June 10, 2026, 11:24am

On a related note, power is reduced in ANCOVA when one maximizes the collinearity between treatment and a set of potential covariates that are indeed weak. In that setting the nonzero off-diagonal terms in the X'X matrix cause some variance inflation in the row and column of (X'X)^{-1} corresponding to treatment and this element is proportional to the variance of the treatment effect estimate.

R_cubed · June 10, 2026, 1:40pm

I think that suggests that matching is indeed powerful when done correctly, but carefully consider what variables to match on, and stick to that in the analysis after the data are collected.

Pavlos_Msaouel · June 10, 2026, 1:58pm

This is nicely codified via the causal diagram shown in Figure 6, Model 9 here.

f2harrell · June 10, 2026, 3:12pm

It depends on what you mean by matching. Most matching algorithms create the need for arbitrary statistical analysis choices and lose power by not accounting for outcome heterogeneity.

R_cubed · June 10, 2026, 9:26pm

I was thinking specifically of balanced allocation experimental designs such as minimization. The entire literature on this issue is rather frustrating to read, especially since the development of resampling methods.

AFAICT, the entire dispute since Fisher and Gossett, was the validity of using a model such as Student’s t distribution, to examine the data from a controlled, but non-random experiment. The biggest complaint (which isn’t obvious) is that the data from such experiments has thinner tails than what would be expected from a normal distribution.

The following discusses the Gossett vs Fisher design of experiments debate from the perspective of Gossett. It is more of a historical narrative than a mathematical statistics paper, but it does touch upon the debate. The figure 1 (page 33 in the PDF) shows power curves for balanced vs. randomized designs.

For large (and economically worthwhile effects), balanced experiments analyzed by the t-test had greater power to detect effects. Power to detect effects declined as the effect size gets smaller.

Ziliak, S. T. (2014). Balanced versus randomized field experiments in economics: why WS Gosset aka “Student” matters. Review of Behavioral Economics, 1(1-2), 167-208. (PDF)

Correct me if I’m wrong, but to me, this debate is moot because we can simply resample from the data, and compute a valid empirical null distribution, as Nathan Kallus discussed in his paper. Whether you want confidence distributions or likelihood functions, resample techniques permit computation of statistical summaries that satisfy a wide audience.

The reduction in sample size for a single experiment that wishes to account for N prognostic factors permits the possibility of actually repeating the protocol as a form of cross validation and builds in redundancy and at least error detection, as the information theorists and communication engineers would think about it.

Essentially, I would budget for a valid randomized design, but conduct 3 smaller, balanced experiments, and then synthesize them using meta-analytic techniques.

This might not be possible for rare disease, but balanced designs still maximize the available information to the experimenter.

All the powerful tools that have been developed – bias and sensitivity analysis as advocated by @Sander, the use of permutation techniques to detect bias, as described by Rosenbaum in this old article, are now available with the vast increases in computing power.

Rosenbaum, Paul R. (1989) “On Permutation Tests for Hidden Biases in Observational Studies: An Application of Holley’s Inequality to the Savage Lattice.” Ann. Statist. 17 (2) 643 - 653, June, 1989. On Permutation Tests for Hidden Biases in Observational Studies: An Application of Holley's Inequality to the Savage Lattice link

Further reading:

Saint-Mont calculates the sample sizes needed for randomization to balance factors (binary through continuous) with varying degrees of frequency. The more common a prognostic factor is, the larger the sample needs to be in order for randomization to account for it. Groups formed by randomization aren’t really exchangeable until hundreds of observations have been collected.

evidenceinthewild · June 10, 2026, 9:40pm

This connects directly to something I posted yesterday: Randomisation in the Wild Part 4: The Permutation Test Nobody Ran

The short version: I went back to the eteplirsen trial (12 patients, 3 arms) and ran the permutation test that Fisher’s framework prescribes, but that the one nobody ran at the time. The published MMRM found no significant 6MWT difference (p ≈ 0.56 for 50 mg/kg vs placebo). The exact permutation test over all C(8,4) = 70 possible allocations returns p = 0.029 for the same comparison, and p = 0.005 for the mITT population (excluding the two patients who lost ambulation).

Same data. Same patients. Same outcome. Same timepoint. A 20-fold difference in p-values, entirely attributable to whether the analysis uses the randomisation or a parametric model with more parameters than the data can support.

The sensitivity analysis is mixed, and I’m transparent about that in the post. The mITT finding is robust across 1,000 plausible individual patient allocations (90% below p = 0.05, median p = 0.019). The 50 mg/kg comparison is more nuanced (median p = 0.086, ~25% below 0.05 at the strict threshold). But in no plausible allocation does the permutation test return anything near 0.56. The MMRM result was an artefact of the parametric machinery, not a description of the data.

The broader point for this thread: the Gossett-Fisher dispute about the validity of t-tests for non-random (balanced/minimized) experiments may be less practically important than it appears, because wherever the randomisation was actually performed, the permutation test is available and assumption-free. The question isn’t whether the t-distribution is appropriate for thin-tailed data from balanced designs. It’s whether anyone reaches for the tool that requires no distributional assumptions at all.

Johannes_Schwenke · June 11, 2026, 7:48am

I really don’t mean to offend, as the content of the post seems great. But the way this post is written rang all my AI alarm bells, and pangram supports that. I totally understand if someone wants to check their post and language using AI, but at some point (e.g., a lot of the content on linkedin) it becomes dystopian.

So far datamethods seemed to be quite unaffected by this and I think it would be a shame if more and more of the replies are AI generated (even if the content is totally on point). It would make me visit less.

As a rule, I would appreciate if users declare openly whether a post has been written by AI. I think it’s totally fine to summarize one’s own work using AI, but I think it would be good practice to be transparent about it.

f2harrell · June 11, 2026, 11:21am

This relates to a broader question, assuming MMRM implies the use of random intercepts. For frequentist models, it may not be the case that being parametric is the problem; it may instead be the case that the sampling distributions being approximated by the frequentist approach are just not good enough, i.e., p-values and confidence intervals are not always accurate.