ANDROMEDA-SHOCK (or, how to intepret HR 0.76, 95% CI 0.55-1.02, p=0.06)


I think we still disagree. You seem to be basing your statements on acceptance of the Neymann-Pearson approach which has largely been abandoned by statisticians, and the entire idea of using a cutoff for statistical significance has been deemed not good statistical practice by the American Statistical Association.

To recap and to make sure my biases are out in the open (which is how I get maximum input from persons such as yourself):

  • I believe that hypothesis testing is a very bad idea. I want to “play the odds” in making decisions.
  • I believe that even if you want to do hypothesis testing, you should not make the absence of evidence is not evidence of absence error. Thus null hypothesis testing is useful only if “reject” the null. If you don’t “reject” you are left in limbo and the main reason you didn’t reject is often a sample size limitation. When RA Fisher was asked what action to take after getting a large p-value he said “get more data.”
  • Use the compatibility interval more than the p-value. The trial’s data are compatible with a huge mortality reduction.


For sure do I base my statements on Neyman-Pearson approach. The topic was “power”, and there is nothing like “power” in Fisher’s approach. So when authors state that they have planned a study using this power and that size, I must think that it’s about Neyman-Pearson, not Fisher, and that the collected data is thus tailored to allow a decision to either “accept A” or to “accept B”.

Bayesian approches also have long time been abandoned by statisticians… :wink: But if this is the case for acceptance tests, why is all the Neyman-Person terminology used? Why is the significance level at which a null hypothesis is rejected is not discussed (why is p < 0.05 taken as a sensible choice, even in large studies)? The way it is presented looks like a pre-set (per-experiment), just as required in Neymans approach - but almost entirely meaningless in Fisher’s approach.

To your recap:

  • I agee to your first point. That’s obvious. But that was not under discussion.

  • For your second point, it’s not clear to me what you mean. You write about “hypothesis testing” and about “null hypothesis testing”. To the best of my knowledge (see e.g. A. Goodman in Ann Intern Med.1999;130:995-1004, E. L. Lehmann in JASA 88, 1242-1249) the term “hypothesis test” or “tests of hypotheses” was coined by Neyman and means “acceptance tests”, and contrasted Fisher’s test as “significance tests” or “tests of significance” (“rejection tests”). Hence you don’t “reject” a null in a hypothesis test. You would actually not even compute a p-value, because it sufficies to know in what accepatnce region the test statistic falls. I assume this is a legacy of NHST, the strange mixture of both philosophical concepts into something unlogical, but widely taught. Now, given you mean “significance test”, I absolutely agree. A significance test is just to see if the data provides sufficient information to see (clearly enough) that the statistical model fails to explain the data.

  • Here I am a bit undecided. Should we encourage to interpret the data just because one side of the interval extends into interesting regions? The other side shows that the data is compatible (or not too incompatible) with even a (slight) increase in the HR. If a HR < 0.8, for instance, would justify to take the effort and change a standard procedure, shouldn’t be the entire interval below that limit? Or a different example: if the interval is just so large that it extends to both sides into interesting regions (say from 0.2 to 4), the conclusion is obviousely different and does not anymore incline to “hope for some decrease in the HR” (although the interval still is more on that side). After all, this interval is just like the point estimate “only” a sample statistic. If we have prior (:shushing_face:) grounds to believe that considerable changes in the HR in either direction are as likely as small to no changes, only then may we believe (after seeing the data, with an interval extending relevantly into one side) that one treatment has a relevantly favourable outcome.


I think this harks back to the always contentious discussion about foundations of statistical inference, and the fact that reasonable people have much different world views about the issues is related to the difficulties inherent in frequentist statistical inference and avoidance of optimal Bayes decision making. So let’s leave it at that. What I would always seek in situations such as this is a series of Bayesian posterior probabilities that the hazard ratio is less than r, where r = 1, .9, .8, .7. Once you get over the (important) arguments about the choice of a prior, playing the odds in a predictive decision making mode is so much simpler to me.


Can I suggest that this discussion needs to be taken back to the journal. A letter of concern from this group to the editors about the use of null hypothesis statistical tests to publish black and white conclusions about what is and isn’t so is unhelpful and indeed misleading to the average clinician.

Aside from everything else, there is the very important point that the ‘negative’ conclusion from the study is actually clinically important : that capillary refill time, which is a much lower tech sign, shows no signs of being inferior. And this is good news for settings in which lactate is not available.

I think that this would be an excellent platform to illustrate Bayesian approaches and their relevance to clinical decision making. The worked scenarios, including graphics, deserve a wider readership to say the least.

The discussion here has been really fascinating. The only disadvantage is that we’re preaching to the choir here. We really need to get back to JAMA and get clinicians onside.


JAMA has (generally) shown reticence to accept letters, but there is some hope - their series on Statistics and Methods indicates that at least someone there cares, Roger Lewis (brilliant clinician-researcher and Bayesian expert) is heavily involved with the journal, and I think they published the Bayesian reanalysis of EOLIA data:

They may be amenable to something similar for ANDROMEDA-SHOCK, and it may even be in the works already - has anyone that commented on this thread had substantive discussions or started working on a Bayesian reanalysis of ANDROMEDA-SHOCK similar to the EOLIA follow-up publication?


Oh, no. It’s fine, I don’t feel like this. All the comments were very pertinent and respectful. I’m glad we have this discussion.


YES! That is the point. In order to make a clear and simple report, we tied our hands. This trial taught me a lot. My Statistical Analysis Plans will never be same…