Research in the field of Critical Care Medicine is increasingly challenging. As an intensivist, I am getting used to seeing helplessly weak hypotheses receiving complicated statistical treatments. The field has focused more on better data analysis than on better hypotheses.
I am studying the pitfalls in translating biological plausibility to clinical relevance, and ultimately estimating the treatment effect in an RCT. Unfortunately, academia has no or little space for such a topic. Hence, I decided to discuss the topic in Substack, in short and provocative posts.
The problem has two parts. The first is the causal chain supporting the RCT causal assumption. Here I depict studies with confusing hypotheses that unsurprisingly yielded ânegativeâ results.
The second part is the marginal utility of the hypothesis, considering the clinical scenario, especially considering what I call the Additive Paradigm. In this post, I lay a framework:
One good example is the âChloride Caseâ, where I discuss how a fragile hypothesis got tested in large RCTs and even had a meta-analysis. This bad hypothesis received disproportional attention.
Finally, there is an example of how the misuse of statistics will provide a âsignificantâ result for a terrible hypothesis.
I think this is a major problem in biomedical research. I would truly appreciate any input from this respected community of biostatisticians. I believe this is a problem people couldnât see coming. How to solve it?
Iâm always glad to see problems in critical care medical research discussed and I hope you get comments from several experts. One minor comment is that hypotheses per se are not as useful as they seem. Iâd rather have good questions, or better still, to have physiologic or patient status measurements that are worth measuring and worth doing something about. Estimation is often more important than hypothesis formulation.
Can you elaborate? Specifically, can you point to an RCT that showed therapeutic efficacy, where selection of the therapy being tested wasnât founded in a well-formulated hypothesis about mechanism of disease development?
I can see the importance of asking good questions as a prerequisite for designing an RCT. But it seems like a solid hypothesis is also a prerequisiteâŚ
For example:
Questions: âWhat is the distribution of clinical trajectories among patients with pneumococcal pneumonia who are sick enough to be admitted to the ICU?â; âAmong patients who die in ICU following a diagnosis of pneumococcal pneumonia, what are the major proximate causes of death?â Examples might include inability to wean off mechanical ventilation, intractable hemodynamic instabilityâŚ"
Hypothesis: âGiven that X has been identified as the main proximate cause of death among patients with pneumococcal pneumonia, and considering the fact that treatment Y addresses this mechanism, we propose that treatment Y, applied to patients with pneumococcal pneumonia, might reduce the risk of death.â
The trial objective could be reframed as follows: The primary objective of this RCT is to estimate the relative risk of treatment Y relative to comparator treatment Z for the treatment of pneumococcal pneumonia. Z could be a placebo, an active control, etc. (Frank and others might prefer odds ratio to relative risk, but that is a peripheral technical point.)
A real life example was the FDAâs criteria for emergency use authorization of the original covid-19 vaccines back in the fall of 2020. The pre-specified efficacy criterion stated in the guidance for industry âDevelopment and Licensure of Vaccines to Prevent COVID-19â was âa point estimate for a placebo-controlled efficacy trial of at least 50%, with a lower bound of the appropriately alpha-adjusted confidence interval around the primary efficacy endpoint point estimate of >30%â. This is an estimation-based, rather than null hypothesis-based, framework for evauating medical products.
This thought is related to John Tukeyâs flippant comment (Stat. Sci., 1991): âAll we know about the world teaches us that the effects of A and B are always differentâin some decimal placeâfor any A and B. Thus asking âAre the effects different?â is foolish.â I suspect that Frank is advocating for us to change our question to âhow different are they, and how precisely do we know thisâ? Answering this question would provide a richer set of information for decision makers than the mere acceptance or rejection of a point-null hypothesis. This argument generalizes to other kinds of hypotheses, such as equivalence, non-inferiority, etc., which can often be re-cast in an estimation framework.
I hestitate to make blanket statements. There are well known examples in physics where Tukeyâs comment is decisively wrong (e.g., neutrino mass).
Thanks. I wasnât advocating for any particular type of experimental design (e.g., null hypothesis testing). Rather, I was just saying that RCTs that have been able, historically, to identify therapeutic efficacy, have usually been grounded in a solid theory of disease mechanism/causation (e.g., plaque rupture causes STEMI).
In situations where disease mechanism and/or trajectory are poorly understood or highly complex, it becomes very challenging to design an experiment that will be capable of discerning the efficacy of any therapeutic intervention (separating signal from noise). I donât know anything about economics, but, given the complexity of economies, RCTs in this field probably face similar challengesâŚ
As noted in the original post, critical care research doesnât usually deal with simple/linear causal pathways from ârootâ disease to outcome. Rather, it deals with complex webs/chain reactions of potentially life-threatening events that were triggered by the root disease. Itâs challenging to figure out where in the web to intervene in order to discernibly impact important outcomes. Asking granular questions about disease mechanisms/trajectory will be important in mapping the web; a lot of important work has likely already been done in this regard.
I think the author of the original post is saying that, once the web is mapped, the âstructuralâ threads will need to be distinguished from the extraneous threads (e.g., by asking âWhat are the most common proximate causes of death?â) and to focus therapeutic development on those threads (?) And only then, after all this preliminary work, will it be possible to design RCTs that stand any chance of separating therapeutic efficacy signals from noise.
This is a very important question. The issue of marginal utility.
These are the fundamental overlapping problems which have caused âcritical care RCT nihilismâ.
Marginal utility
Too many Pts going to die regardless making N insufficient.
Legacy lumped Crit Care âSynthetic Syndromesâ like ARDS and sepsis containing a mix of different diseases many of which do not have the targeted driver.
As an advocate participant in oncology trials in the National Clinical Trials Network (NCTN), I often wondered if the instinct to be a good colleague held back frank negative feedback on proposed trial concepts.
Simulating studies before running them might help a lot. Iâm looking at an $11M trial right now that (judging from the ClinicalTrials-dot-gov entry) probably ought to spend several % of its budget on simulation studies up-front. Isnât there a @f2harrell quote (adjusted for inflation) along the lines of,
A $110 analysis can make an $11M trial worth $1,100.
I read the Chloride Case with some interest, as a longtime fan of Stewart. But what Iâm missing there and in your post above here is the alternative. Can you point to some pressing intensivist questions that do need further investigation in RCTs? Why do these not excite the community?
Thank you for sharing very interesting insights.
I want to ask you about your opinion regarding the use of inotropes as a part of goal-oriented therapy in patients with acute heart failure and cardiogenic shock. More specifically, isnât the current approach for using inotropes âtoo liberalâ?
Sorry for the late answer I lost sight of the app and I have notifications off.
I think we donât need more RCTs in critical care medicine.
We need more observational studies to describe the syndromes we are treating. As commented above, once we have causality models as strong as the coronary obstruction model of myocardial infarction we could seek for differential treatment effects. We are wasting money and careers in half-baked hypotheses that serve not to advance knowledge, but to keep the business going.
Thank you. I think there is a confusion here. We intensivists mistake treatment for supportive measures.
Inotropes are mere support measures to keep the patient alive while we treat the cause of the shock. Thatâs why the effect of inotrope choice is marginal, as well as using an intra-aortic balloon pump or ECMO etc.
I donât think the indication is too liberal because it is only a supportive measure. It should be used as needed. Provided you keep the patient lucid, warm and urinating any supportive approach will do. The same for ECMO. Iâd only use less inotropes in a context of plenty ECMO access.
This has been the view of anesthesiologists who commonly operate to deal with hypotension with a range of inotropes, alpha agonists, fluid, and I:E ventilation adustments by herustics and without RCT data.
We are beginning to understand that profound heterogeneity of the drivers, (the target of the treatment) renders broadly applied RCT negative or, if positive, nonreproducuble.
This is especially true when a unknown subset may be harmed.
Other than narrow the set under test, Iâm not sure how this problem with RCT for broad conditions can be solved.
It would be great to see more input about this as we face a real dilemma in the profoundly heterogenous environment of critical care where, for an unknown portion, nothing that is done
will save the patient. However, for others, there must be a best treatment definable by RCT if we can figure out how to do that in this environment.
An example of this is the large RCT of different arterial oxygen targets (this is also supportive not targeted treatment). These have been all over the map and presently the latest suggests the target does not matter. Of course it might matter in a subset but who knows. Another massive multicenter study, like the one presently in progress, is not going to provide information useful for the instant patient under care.
We really have to address this. RCT nihlism in the ICU is something that evolved over the past 15 years. We all trusted the critical care RCT prior to that.
The trite saying that âGiven a grant, you can do an RCT on a ham sandwich!â would have seemed silly 20yrs ago.
Now, in the ICU, it has proven to be a useful metaphor.
The authors suggest that quality control and consistency of ICU care need to be addressed before there can be any reasonable hope of improving the track record of critical care RCTs:
ââŚreliable clinical practice and meaningful outcome assessments are also necessary prerequisites to perform thoughtful experiments (RCTs) to determine causality and evaluate the effects of novel interventions.â
Since it has been difficult to show, using RCTs, that intervening on one thread within a complex ICU âwebâ of care can improve outcomes, then maybe ICU RCTs would do better to focus on testing the efficacy of quality control-oriented packages of supportive care i.e., interventions that target multiple threads in the web (?)
If you reflect on your years of ICU experience, I bet that you could identify some cardinal sins committed routinely by suboptimally-experienced physicians. Maybe some residents on overnight ICU shifts tend to react the wrong way to certain changes in a patientâs ventilatory status, misjudge a patientâs volume status, miss an important clinical sign for shock, fail to appreciate the importance of a worsening lab value, fail to recognize the early signs of an important adverse drug reaction, attempt a new procedure with insufficient oversight (resulting in iatrogenic harm) etc⌠The impact of any one of these types of mistakes on a patientâs outcome might not be discernible, but, if they were to occur in clusters during the care of a given patient, could plausibly lead to a worse outcome than might otherwise have occurred if some type of preventative quality control algorithm had been in place.
I expect that there are lots of studies looking at quality control within ICUs, but Iâm not sure how many published RCTs have examined the efficacy of specific quality control protocols/packages (?)
I think it is a pertinent point. I have been talking the treatmentâs effects are marginal in nature, but one should be aware the unintended effects and the complications are not. Their effects are additive. We have learned to reduce mortality by avoiding complications. I think itâs the major evolution in the last 20 years of critical care. It is the only way to move forward in the absence of well-described syndromes.
I canât help but think back to my own ICU exposure in the 1990s. Each night, there was a single medical resident (often second year) manning a 25-30 bed ICU, plus taking ICU admissions through the ER, plus responding to pre-arrests on the hospital wards. A critical care fellow was accessible by phone overnight and would come into the hospital if needed. Needless to say, this arrangement was absolutely mortifying and completely inadequate. It was common for the hapless overnight ICU resident to have to manage several cardiac arrests simultaneously. Itâs not a stretch to imagine that the overnight care of other ICU patients might have suffered in this environmentâŚ
The question of whether overnight in-house availability of intensivists might improve patient outcomes has been asked before. For example, this 2013 NEJM publication tried to assess whether patients who were monitored overnight by an intensivist on their night of admission fared better than those who were cared for by medical residents on their night of admission [note that the ICU in this study had three medical residents each night (not one) overseeing a 24-bed ICU]:
The authors didnât identify a meaningful improvement in outcomes for the patients who were admitted and monitored on their first night by an intensivist. However, this study didnât address the very important question of whether consistent night-time intensivist coverage would, over time, improve the outcomes of all ICU patients.
Unless we start training a lot more intensivists in a hurry, proving that ICU patients do better, on the whole, when intensivists are available, in person, 24/7, might just be an exercise in frustration. But, as Stephen Sean noted https://onlinelibrary.wiley.com/doi/full/10.1002/sim.6739:
âOne of Demingâs important lessons to managers was that they had to understand the origins of variation in a system in order to be able to intervene effectively to improve its quality.â
Could suboptimal in-person access to intensivists, if pervasive, be such a huge source of care quality variability and therefore statistical ânoiseâ within ICU settings that it effectively dwarfs any signals of efficacy in critical care therapeutic trials (?)
Interesting point. Iâve never thought of how that might affect an RCT (except in the abstract).
Unlike many more straightforward testings, RCT in the ICU with, for example, survival as an endpoint has outside factors which go beyound compliance, for example.
Care of the critically ill is complex and all in the hospital know that survival in critical care is also a direct function on the competence, experience and commitment of the team.
Although no one likes to admit it, in the hospital there are the A physcians and the B-physcians.
Iâm, not sure how that could be randomized or rendered reproducible. .
Its great to look at an example of how biologic plausibility is studied in sepsis. Here they randomize 20 cases meeting the 1996 set of thresholds called SOFA (now called Sepsis 3).
Note 40% of the placebo cases had bacteremia (a predictor of death) and 10% of treated. Obviously randomization of 20 cases has its limits!
40% hospital mortality for both treated and control. 8 out of 20 in the RCT died! This is an abysmal death % for septic shock.
Yet this is the type of article published couched as a successful pilot.
The point here is that this is an example of biologic plausibility tested by an RCT applied to a cognitive bucket of very many complex diseases caused by infection.
Almost half died but the trialists administered norepinephrine for a
statistically less duration in the
Vitamin B12 treated population.
Does anyone have a recommendation on how the âvitamin B12 for sepsis vasoplegia hypothesisâ might be better tested.?
I think this is another instance of confounding supportive measures with actual treatments for diseases. The patient is not dying of lack of cyanocobalamin, hence there is no sense in testing for mortality. They should test another outcome, for example, a vasoplegia surrogate, and avoid leaping from less vasoplegia to less mortality.
Moreover, studies like this are contaminated by cognitive bias from the inception. Authors canât get that there isnât a disease called sepsis, but itâs an umbrella term for an unknown number of maladaptative responses for infection. No one knows what is the dominant cause of death, but researchers keep acting as if they know.
Finally, I think there is a third, unappreciated form of cognitive bias in believing that the rigor and preciseness of the statistical analysis compensate for the lack of rigor and preciseness in study conception.
Critical care research is hopelessly contaminated by these bias.