How do we mistake biological plausibility for clinical relevance?

Research in the field of Critical Care Medicine is increasingly challenging. As an intensivist, I am getting used to seeing helplessly weak hypotheses receiving complicated statistical treatments. The field has focused more on better data analysis than on better hypotheses.

I am studying the pitfalls in translating biological plausibility to clinical relevance, and ultimately estimating the treatment effect in an RCT. Unfortunately, academia has no or little space for such a topic. Hence, I decided to discuss the topic in Substack, in short and provocative posts.

The problem has two parts. The first is the causal chain supporting the RCT causal assumption. Here I depict studies with confusing hypotheses that unsurprisingly yielded “negative” results.

The Land of Irrelevance #1 - The PReVENT Study

The Land of Irrelevance #2 - The MENDS2 Study

The second part is the marginal utility of the hypothesis, considering the clinical scenario, especially considering what I call the Additive Paradigm. In this post, I lay a framework:

How do we mistake biological plausibility for clinical relevance?

One good example is the “Chloride Case”, where I discuss how a fragile hypothesis got tested in large RCTs and even had a meta-analysis. This bad hypothesis received disproportional attention.

Finally, there is an example of how the misuse of statistics will provide a “significant” result for a terrible hypothesis.

Irrelevant Yet Significant: The DEFENDER study (part 1/2)

I think this is a major problem in biomedical research. I would truly appreciate any input from this respected community of biostatisticians. I believe this is a problem people couldn’t see coming. How to solve it?


I’m always glad to see problems in critical care medical research discussed and I hope you get comments from several experts. One minor comment is that hypotheses per se are not as useful as they seem. I’d rather have good questions, or better still, to have physiologic or patient status measurements that are worth measuring and worth doing something about. Estimation is often more important than hypothesis formulation.

1 Like

Can you elaborate? Specifically, can you point to an RCT that showed therapeutic efficacy, where selection of the therapy being tested wasn’t founded in a well-formulated hypothesis about mechanism of disease development?

I can see the importance of asking good questions as a prerequisite for designing an RCT. But it seems like a solid hypothesis is also a prerequisite…

For example:

Questions: “What is the distribution of clinical trajectories among patients with pneumococcal pneumonia who are sick enough to be admitted to the ICU?”; “Among patients who die in ICU following a diagnosis of pneumococcal pneumonia, what are the major proximate causes of death?” Examples might include inability to wean off mechanical ventilation, intractable hemodynamic instability…"

Hypothesis: “Given that X has been identified as the main proximate cause of death among patients with pneumococcal pneumonia, and considering the fact that treatment Y addresses this mechanism, we propose that treatment Y, applied to patients with pneumococcal pneumonia, might reduce the risk of death.”

1 Like

The trial objective could be reframed as follows: The primary objective of this RCT is to estimate the relative risk of treatment Y relative to comparator treatment Z for the treatment of pneumococcal pneumonia. Z could be a placebo, an active control, etc. (Frank and others might prefer odds ratio to relative risk, but that is a peripheral technical point.)

A real life example was the FDA’s criteria for emergency use authorization of the original covid-19 vaccines back in the fall of 2020. The pre-specified efficacy criterion stated in the guidance for industry “Development and Licensure of Vaccines to Prevent COVID-19” was “a point estimate for a placebo-controlled efficacy trial of at least 50%, with a lower bound of the appropriately alpha-adjusted confidence interval around the primary efficacy endpoint point estimate of >30%”. This is an estimation-based, rather than null hypothesis-based, framework for evauating medical products.

This thought is related to John Tukey’s flippant comment (Stat. Sci., 1991): “All we know about the world teaches us that the effects of A and B are always different–in some decimal place–for any A and B. Thus asking ‘Are the effects different?’ is foolish.” I suspect that Frank is advocating for us to change our question to “how different are they, and how precisely do we know this”? Answering this question would provide a richer set of information for decision makers than the mere acceptance or rejection of a point-null hypothesis. This argument generalizes to other kinds of hypotheses, such as equivalence, non-inferiority, etc., which can often be re-cast in an estimation framework.

I hestitate to make blanket statements. There are well known examples in physics where Tukey’s comment is decisively wrong (e.g., neutrino mass).

I couldn’t find the original 2020 FDA document, so the quote is taken from

The Tukey quote is from


Thanks. I wasn’t advocating for any particular type of experimental design (e.g., null hypothesis testing). Rather, I was just saying that RCTs that have been able, historically, to identify therapeutic efficacy, have usually been grounded in a solid theory of disease mechanism/causation (e.g., plaque rupture causes STEMI).

In situations where disease mechanism and/or trajectory are poorly understood or highly complex, it becomes very challenging to design an experiment that will be capable of discerning the efficacy of any therapeutic intervention (separating signal from noise). I don’t know anything about economics, but, given the complexity of economies, RCTs in this field probably face similar challenges…

As noted in the original post, critical care research doesn’t usually deal with simple/linear causal pathways from “root” disease to outcome. Rather, it deals with complex webs/chain reactions of potentially life-threatening events that were triggered by the root disease. It’s challenging to figure out where in the web to intervene in order to discernibly impact important outcomes. Asking granular questions about disease mechanisms/trajectory will be important in mapping the web; a lot of important work has likely already been done in this regard.

I think the author of the original post is saying that, once the web is mapped, the “structural” threads will need to be distinguished from the extraneous threads (e.g., by asking “What are the most common proximate causes of death?”) and to focus therapeutic development on those threads (?) And only then, after all this preliminary work, will it be possible to design RCTs that stand any chance of separating therapeutic efficacy signals from noise.


This is a very important question. The issue of marginal utility.

These are the fundamental overlapping problems which have caused “critical care RCT nihilism”.

  1. Marginal utility
  2. Too many Pts going to die regardless making N insufficient.
  3. Legacy lumped Crit Care “Synthetic Syndromes” like ARDS and sepsis containing a mix of different diseases many of which do not have the targeted driver.
1 Like

As an advocate participant in oncology trials in the National Clinical Trials Network (NCTN), I often wondered if the instinct to be a good colleague held back frank negative feedback on proposed trial concepts.


Simulating studies before running them might help a lot. I’m looking at an $11M trial right now that (judging from the ClinicalTrials-dot-gov entry) probably ought to spend several % of its budget on simulation studies up-front. Isn’t there a @f2harrell quote (adjusted for inflation) along the lines of,

A $110 analysis can make an $11M trial worth $1,100.

I read the Chloride Case with some interest, as a longtime fan of Stewart. But what I’m missing there and in your post above here is the alternative. Can you point to some pressing intensivist questions that do need further investigation in RCTs? Why do these not excite the community?

1 Like