ANDROMEDA-SHOCK (or, how to intepret HR 0.76, 95% CI 0.55-1.02, p=0.06)

I think we still disagree. You seem to be basing your statements on acceptance of the Neymann-Pearson approach which has largely been abandoned by statisticians, and the entire idea of using a cutoff for statistical significance has been deemed not good statistical practice by the American Statistical Association.

To recap and to make sure my biases are out in the open (which is how I get maximum input from persons such as yourself):

  • I believe that hypothesis testing is a very bad idea. I want to “play the odds” in making decisions.
  • I believe that even if you want to do hypothesis testing, you should not make the absence of evidence is not evidence of absence error. Thus null hypothesis testing is useful only if “reject” the null. If you don’t “reject” you are left in limbo and the main reason you didn’t reject is often a sample size limitation. When RA Fisher was asked what action to take after getting a large p-value he said “get more data.”
  • Use the compatibility interval more than the p-value. The trial’s data are compatible with a huge mortality reduction.
3 Likes

For sure do I base my statements on Neyman-Pearson approach. The topic was “power”, and there is nothing like “power” in Fisher’s approach. So when authors state that they have planned a study using this power and that size, I must think that it’s about Neyman-Pearson, not Fisher, and that the collected data is thus tailored to allow a decision to either “accept A” or to “accept B”.

Bayesian approches also have long time been abandoned by statisticians… :wink: But if this is the case for acceptance tests, why is all the Neyman-Person terminology used? Why is the significance level at which a null hypothesis is rejected is not discussed (why is p < 0.05 taken as a sensible choice, even in large studies)? The way it is presented looks like a pre-set (per-experiment), just as required in Neymans approach - but almost entirely meaningless in Fisher’s approach.

To your recap:

  • I agee to your first point. That’s obvious. But that was not under discussion.

  • For your second point, it’s not clear to me what you mean. You write about “hypothesis testing” and about “null hypothesis testing”. To the best of my knowledge (see e.g. A. Goodman in Ann Intern Med.1999;130:995-1004, E. L. Lehmann in JASA 88, 1242-1249) the term “hypothesis test” or “tests of hypotheses” was coined by Neyman and means “acceptance tests”, and contrasted Fisher’s test as “significance tests” or “tests of significance” (“rejection tests”). Hence you don’t “reject” a null in a hypothesis test. You would actually not even compute a p-value, because it sufficies to know in what accepatnce region the test statistic falls. I assume this is a legacy of NHST, the strange mixture of both philosophical concepts into something unlogical, but widely taught. Now, given you mean “significance test”, I absolutely agree. A significance test is just to see if the data provides sufficient information to see (clearly enough) that the statistical model fails to explain the data.

  • Here I am a bit undecided. Should we encourage to interpret the data just because one side of the interval extends into interesting regions? The other side shows that the data is compatible (or not too incompatible) with even a (slight) increase in the HR. If a HR < 0.8, for instance, would justify to take the effort and change a standard procedure, shouldn’t be the entire interval below that limit? Or a different example: if the interval is just so large that it extends to both sides into interesting regions (say from 0.2 to 4), the conclusion is obviousely different and does not anymore incline to “hope for some decrease in the HR” (although the interval still is more on that side). After all, this interval is just like the point estimate “only” a sample statistic. If we have prior (:shushing_face:) grounds to believe that considerable changes in the HR in either direction are as likely as small to no changes, only then may we believe (after seeing the data, with an interval extending relevantly into one side) that one treatment has a relevantly favourable outcome.

I think this harks back to the always contentious discussion about foundations of statistical inference, and the fact that reasonable people have much different world views about the issues is related to the difficulties inherent in frequentist statistical inference and avoidance of optimal Bayes decision making. So let’s leave it at that. What I would always seek in situations such as this is a series of Bayesian posterior probabilities that the hazard ratio is less than r, where r = 1, .9, .8, .7. Once you get over the (important) arguments about the choice of a prior, playing the odds in a predictive decision making mode is so much simpler to me.

1 Like

Can I suggest that this discussion needs to be taken back to the journal. A letter of concern from this group to the editors about the use of null hypothesis statistical tests to publish black and white conclusions about what is and isn’t so is unhelpful and indeed misleading to the average clinician.

Aside from everything else, there is the very important point that the ‘negative’ conclusion from the study is actually clinically important : that capillary refill time, which is a much lower tech sign, shows no signs of being inferior. And this is good news for settings in which lactate is not available.

I think that this would be an excellent platform to illustrate Bayesian approaches and their relevance to clinical decision making. The worked scenarios, including graphics, deserve a wider readership to say the least.

The discussion here has been really fascinating. The only disadvantage is that we’re preaching to the choir here. We really need to get back to JAMA and get clinicians onside.

3 Likes

JAMA has (generally) shown reticence to accept letters, but there is some hope - their series on Statistics and Methods indicates that at least someone there cares, Roger Lewis (brilliant clinician-researcher and Bayesian expert) is heavily involved with the journal, and I think they published the Bayesian reanalysis of EOLIA data:

They may be amenable to something similar for ANDROMEDA-SHOCK, and it may even be in the works already - has anyone that commented on this thread had substantive discussions or started working on a Bayesian reanalysis of ANDROMEDA-SHOCK similar to the EOLIA follow-up publication?

2 Likes

Oh, no. It’s fine, I don’t feel like this. All the comments were very pertinent and respectful. I’m glad we have this discussion.

YES! That is the point. In order to make a clear and simple report, we tied our hands. This trial taught me a lot. My Statistical Analysis Plans will never be same…

2 Likes

We have submitted a Bayesian reanalysis of this trial. It’s been under review for 5 months, as I type. I will share the results here if the paper is eventually published.

4 Likes

Now I have time to read this thread and I found it very interesting and enlightening about the advantages of Bayesian models. I have learned a lot. Thank you very much to all.

1 Like

Hi all,
The Bayesian reanalysis of ANDROMEDA is now published in the Blue Journal at: https://www.atsjournals.org/doi/abs/10.1164/rccm.201905-0968OC
The full R script used is available in the appendix.
Analysis were made using brms and rstanarm. Differently from the main paper, we used binary mortality at 28 and 90 days as outcome and compared results with the frequencist approach.
Both were mixed (or hierarquical, as you please) models with same adjustments.
We used 4 different sets of priors for the Bayesian analysis (optimistic, neutral, null and pessimistic), which were somewhat week (not really strong priors).
We also reported SOFA clearance in an attempt to use Bayesian Networks which ultimately required some SOFA categorization (sorry Frank Harrell =).
We did not use survival models for two reasons:

  1. Hard to fit parametric survival model in this scenario (Weibul did not fit well) and the semi-parametric Bayesian approaches I found were hard to compare with the traditional Cox model.
  2. Reviewers and editors wanted both methods (Bayesian and Frequencist) to be compared using the same approach for the reader. It was easier to do so using binomial models.
  3. The discussion tries to be balanced and not favoring any of the given approaches. We tried not to blame p values as a curse for mankind but mentioning how p-value dichotomia may be bad.
    We really hope you guys like it and we are all willing to discuss it.
    Data sharing is definitely an option is anyone is interested given reasonable limits and inside a drafted protocol.
9 Likes

I’m a clinician that’s early in a research career. I’ve always been fascinated by statistics but had stopped studying because of lack of time. This thread with the reanalysis and how it all came alive in Twitter has inspired me to start studying again. Thank you all! This discussion is great!

2 Likes

New comments from David Spiegelhalter on this study, and how it represents the greater issue of drawing conclusions from statistical significance testing.

5 Likes

I fully agree with this article and also did not answer the questionnaire by John Ioannidis and Tom Hardwicke as the options given were unsatisfactory. I do not use power calculations to interpret clinical trials after they are completed and find it largely inappropriate to do so regardless of whether the analysis is more “frequentist” or “Bayesian”.

2 Likes

Agree. The question was poorly written and IMO the entire questionnaire was somewhat of a baiting exercise. None of those choices reflected how I would interpret the results.

2 Likes