ANDROMEDA-SHOCK (or, how to intepret HR 0.76, 95% CI 0.55-1.02, p=0.06)


I’m starting this with intent to have a good scrap/thread about how to interpret the ANDROMEDA-SHOCK trial:

We’ve seen this basic problem before: most likely for budget/logistical reasons, the study size was capped at something that was only likely to reject the null if there is a very large effect (power calculation was based on capillary refill time [vs lactate] reducing mortality from 45% to 30%).

Actual results: 28-day mortality was lower with CRT (34%) than lactate (43%) but the final Cox model shows HR=0.76, 95% CI 0.55-1.02, p=0.06. As is customary, the conclusion was forced to state something like “no significant difference” about this trial or that use of CRT “did not reduce” mortality versus lactate.

I’ll need some clinical input from critical care folks, but it’s my understanding that capillary refill time is an easier thing to do than lactate, which is going to be germane in this discussion as well. If that is true, it’s arguable that even a conclusion of non-inferiority would support use of CRT rather than routine lactate measurement, but the trial was not designed as non-inferiority.

I have plenty more to say but that ought to be enough that we can open the floor for discussion.


I’ll pitch in straight away: that JAMA headline is hugely misleading. “Didn’t reduce all-cause 28-d mortality” is just wrong.


Lactate is often measured by blood-gas machines present in every western ICU. Provides “instant gratification” and “objective” data. Possibly the main reason why decommissioning of many harmful therapies (e.g. large TV-ventilation) is so hard to do. Capillary refill time is likely to be perceived as less “objective” and harder to document. Therefore I’m a little pessimistic that it will replace lactate measurements in well equipped ICUs. The study is very good news for those working in less well endowed ICUs, however.


How is an inconclusive study good news for anyone? JAMA and the study authors are making a classic “absence of evidence is not evidence of absence” error.

There are other issues, such as whether the most accurate confidence interval (log-likelihood based, from the confidence profile) would make a difference, and whether even with the data in hand the study is “negative”. A Bayesian posterior probability that the HR < 1.0 even with a skeptical prior is likely to be large. A better person would be unwise to bet on the HR being >= 1.0.


If your argument is that a Bayesian analysis would likely support capillary refill over lactate, then that is certainly good news for those who cannot afford a blood gas machine.


Thanks very much for the additional clinical context. I would certainly say that this data (at a bare minimum) supports the idea that a strategy based on capillary refill time is a more-than-acceptable substitute for lactate measurement in ICU’s where lactate machines are not available. As a few folks have already commented, if there was money to be made from capillary refill time, the post-trial spin would likely be quite different…


My guess, and probably yours as well, is that the study authors did not prefer this wording/conclusion, but were railroaded into this as a condition of publishing in JAMA.

An interesting question: suppose that the authors had decided to withdraw the study from JAMA to protest being forced to use this language, and instead sought to publish in a less-prestigious journal. What are the relative ramifications for the authors, the trial, and medicine at large?

Sure, it would be nice for the authors to take a principled stand against using this language, but at the cost of publishing in a much less visible journal - perhaps Annals of the American Thoracic Society, which has quite progressive statistical guidelines and may have been more willing to report this result in more sensible fashion - but Annals of the ATS is less visible than JAMA, and may not have reached the intended clinical audience quite so much. Plus, going down on the impact scale is not always a guarantee - reviewers at the progressively lower tier journals are (sometimes) less methods-savvy, but also more dogmatic in their application of the few statistical principles they do know.

(not saying there’s a right or a wrong here, just talking through the relative consequences of authors taking a stand against this language - if it means getting shoved down a couple journal-impact tiers before someone is is finally not going to make you use the phrase “no difference” or “did not reduce mortality” - the cost is not just personal prestige for the authors, but the impact/visibility of the findings take a big hit - in the case of ANDROMEDA-SHOCK, despite the “non significant” result, it looks like this may have a meaningful impact on clinical practice anyway, as discussed above)


Worth noting: one of the study authors responded to this Tweet to briefly discuss their MCID / power calculation. I will invite him to post here, if he is willing:

Interesting fact that I saw on Twitter: this study received no funding, but was run entirely on goodwill of the practitioners who thought this was an important question. So I’m apt to go easy on the trialists themselves, since they did something pretty incredible to run a trial of this scale with no money.


The twitter replies seem to conflate MCID with ‘best guess at what the true effect might be’. I see this a lot, but the two things seem to me to have very little in common (conceptually). Which is the most appropriate basis for a sample size calculation?


That’s a good point, and indeed the two are often discussed interchangeably when they are not the same. I think MCID is the better choice for a sample size calculation, ideally paired with an interim monitoring strategy that allows for early stopping if the effect is so large that it’s no longer ethical to randomize because the data already are overwhelmingly in favor of one therapy.

Using the “best guess at what the true effect might be” is prone to leaving studies underpowered if there maybe/is a “real” effect but it’s smaller than the “best guess” that was used (ANDROMEDA-SHOCK is a perfect example; EOLIA is another). Using the MCID is closer to the purpose of doing the trial, IMO - let’s figure out of the treatment works well enough that it merits use in clinical practice.


I believe asking clinicians to define a MCID will often return a blank stare. The truth is that we take whatever we can get. This is why oncologists offer therapies with little demonstrable benefit and intensivists find it hard to terminate treatment of patients who still have a heartbeat. In the absence of strong leadership, only monetary constraints, demonstrable harm or lack of capacity, will prevent us from grasping for straws…


As someone who was aware of the study protocol, I have some comments to make:

  1. As Andrew mentioned, the study was made with no support but the willingness to help by the centers, so we should probably go easy on harsh comments about sample size, etc.
  2. I believe that in this specific scenario a non-inferiority trial could be helpful. Just by proving that CRT is non-inferior to lactate clearence guided therapy can be very helpful is resource-constrained scenarios.
  3. I am pretty sure the authors are aware of the limitations of the conclusion. Both Alexandre Biasi and Lucas Damiani are experienced trialists, as most of the steering committee. Andrew made an excellent point in stating that sometimes journals require more direct conclusions. JAMA recently published an awesome Bayesian analysis of the EOLIA trial which signs, in my opinion, an important shift.


Now that you mention it, ANDROMEDA-SHOCK would be very well suited to the same sort of follow-up publication as the EOLIA example…


The bigger shift I believe is necessary is not just from frequentist to Bayesian, but that the glamor journals need to change their perspective on RCTs. There is zero doubt that RCTs are the best way of gathering medical evidence of therapeutic efficacy. However, they must be designed and implemented well if a result worthy of generalization to clinical practice is desired. An RCT that is so under powered that a quite large reduction in death (8.5% absolute difference in death at 28 days!) is not statistically significant is a major design flaw.

It’s important to publish both positive and negative trials, but it is also important to not take flawed “negative” trials and present them with unwavering, overly rigid interpretations of potentially strong signals. The formulaic interpretation of: RCTs infallible, p>0.05 means intervention useless is deeply flawed.


This is a good example where a conditional equivalence test might have been helpful.


Agreed, with one caveat: given that this trial was entirely unfunded, run on the goodwill of the participating physicians and centers alone - I am inclined to go easy on the ANDROMEDA-SHOCK trialists for this. Yes, it’s a design flaw, and one of the trial investigators who spoke to me on Twitter admitted as much, but this trial seems to have some value as-is (this isn’t a drug/device approval on the line, but probably suggestive that at a bare minimum capillary refill time is an acceptable alternative to lactate monitoring where lactate monitoring is not available).

But I fully agree with this:

The rote use of phrases like “did not reduce…” and “no significant difference” is extremely frustrating, and something that journals should do better with.


I don’t disagree, but I wasn’t aware that top journals gave extra credit for running a trial on too few resources rather than the clinical implications of the study (that is unless the use of fewer resources is itself innovative, e.g. registry RCTs).

Here they elevate a trial which deserves credit but is no more than hypothesis generating in its results (as analyzed). In doing so, they subject it to a false rigor in interpretation/reporting that were developed for trials with designs and power sufficient to be definitive.


I’ve enjoyed scrolling through this discussion. I share Andrew’s frustration. When I first “got into” trials I did not find it easy (& sometimes still don’t) to find accurate alternative wording. My preference is to let the numbers speak for them selves and simply say “the difference was X (CI…)” or the “change was …” . But others may have better ways at expressing these things - I’d love to hear them? In particular, I’m interested in how to express a result when, for example, lower bound of the CI is more important than the point estimate?


Expensive invasive test vs something you can do by yourself , establishing rapport and actually justifying your clinical salary.

So yes the study killed the lactate even if one wanted to state no difference based on p=0.06

Nothing to discuss - no lactate for you


Agree with @chrisarg that capillary refill method is easier and cheaper than lactat measurement. So, whether you use frequentist or bayesian methods, capillary refill wins.
@ADAlthousePhD: I calculated amateurly bayesian stat. using study results and used Kaul&Diamond paper/Calc and Wijeysundera paper for calc.
@f2harrell or @Anupam_singh may help for detailed code.