There are a growing number of research papers that are proposing the use of the “Fragility Index” to be published alongside RCTs. Unsurprisingly, when this has been investigated in surgery, critical care (pubmed 26963326) and more recently anaesthesia (PubMed 31025706), most studies have been found to be “fragile”.

I have numerous problems with the fragility index. For starters, RCTs are generally powered for a minimum clinically important difference. So studies by definition should be “fragile” under this definition, or the study recruited more patients than was necessary to answer the question at hand (and hence one could argue, was unethical).

This method appears to be gaining momentum, particularly in anaesthesia and critical care. I would be very interested to hear what this community has to comment on the fragility index. Of course if I am wrong, then I would be happy to be educated and change my position.

thank you for saying this. i’ve been saying it but i think views range from indifference to mild support. i think it’s a classic case of clinicians with poor instinct for stats promoting an idea at statisticians; and the statos ready acquiescence is puzzling to me (although @f2harrell pointed out on here that the fragility index goes back decades, attributable to an epidemiologist). A stato shouldnt hesitate to describe for their boss (clinician) the tainted motivation of this idea, i did, and if i had a spare moment id find a way to illustrate it with data … it’s the only way to persaude. It shows such poor instinct and timing - when dichotomising p-values are under incessant attack.

It took me awhile understand what you meant, and why this is such a bad idea.

I hope to make the idea explicit, if only to possibly correct gaps in my own understanding, but hopefully to give people who are somewhat confused a bit better intuition.

If we were to translate p values to an additive evidence metric, there is very little difference (in terms of evidence) between a p value slightly above a “significance” level, and slightly below.

The p value itself is random, worrying what it would take to get a “not significant” result is worrying about a random number of literally no significance (pun intended).

As indirect measures of evidence, exact p values should be reported for a data set. As a statistic for a decision rule, the rejection level is set a priori, and you are violating the spirit of the framework by worrying about sensitivity to results just above your threshold.

i wouldnt know where to begin with the FI, but there is something immediately perverse about contemplating: “how much would i need to change these data to change the result?” It has the basic hallmarks of a corrupt idea: it is unsophisticated + gives the false impression that we have discerned something new about the data

The fastest way to get the Fragility Index rejected would be to suggest that drug regulators take it into consideration by asking how many patients would it take to change a “not significant” finding to a “significant” one in terms of detecting adverse effects in drug studies.

Then, the weight of industry and academic opinion would be brought to bear on why this is a wrongheaded idea.

A few days ago, I just happened to dig out my copy of The Cult of Statistical Significance by academic economists Stephen T. Ziliak and Deidre McCloskey.

According to the authors, a Vioxx clinical trial in 2000 was reported as having 5 heart attack in the Vioxx group, vs. only 1 in the control. Based on the sample size, this difference in rate was “not statistically significant.”

Apparently, it later was discovered that the data was misrepresented, and there were actually 8 adverse events in the Vioxx group. Re-calculating the results made this a “statistically significant” increase. (pages 28-29).

A variation of the “Fragility Index” seems defensible to me when the analyst is examining data known to have missing elements not at random. Specifically, I’m thinking of the metric known as Failsafe-N for publication bias in a meta-analysis.

Does anyone think even in this scenario, the metric is flawed?

i think that’s ok because it’s in response to publication bias, similar to a worst-case analysis for missing data. Both are considered a sensitivity analysis, maybe confined to a supplementary doc. But the FI is treated as supplementary to a p-value, as if it aids interpretation, and it is not an attempt to counteract bias or encourage scepticism where it might be warranted. I honestly believe it’s a manifestation of the feverish desire to publish; it’s a very easy thing to think about and write about, papers on the FI usual amount to nothing more than promotional material

I, also, am very much not a fan of the fragility index, and have jousted with folks about this on Twitter before. I think this short piece in JAMA Surgery covers my objections fairly well.

For one, the FI is just another flipped-around version of a p-value.

For two, rejection (or skepticism) of results based on the FI is at odds with the balance any RCT must strike: recruiting the minimal number of patients required to answer the study question (with whatever operating characteristics the trial is designed to have). I said this once on Twitter, but if people have a problem with “fragile” results based on the p<0.05 cutoff, they’re basically saying that results at p<0.05 aren’t good enough (which might be a fair discussion!) but that instead of calculating FI, they ought to be advocating for lower alpha levels (or Bayesian approaches with a very high threshold of certainty) rather than just pointing out “fragility” of published results.

As for Paul’s point, I’m not sure that statisticians “readily acquiesce” to things like this so much as we aren’t even at the table when they’re conceived or published (same thing as our recent post-hoc power debacle shows). As far as I can tell from my discussions on Twitter, this has become popular with a certain profile of clinician: the hardcore skeptic, whose default position is basically “trust nothing” and therefore loves to find any reason possible to reject nearly any positive trial’s results. Plus, now it’s an easy way to get papers published by doing quick “systematic reviews” and shoving out a paper about how ‘fragile’ most published trials are. All we can do is write letters like Acuna et al did, but there’s only so much excess energy I have for fights about these things (see: post-hoc power debate) when I also have real teaching and work to do.

i think they always want a statistician among the authors, for completeness or credibility, or maybe they think theyre doing some junior stat a favour, but that’s the statistician’s opportunity to remove the fragility index, or remove their name. that’s what i mean when i say “readily acquiesce” ie rolling over because maybe we think of ourselves as a service-provider

I was first introduced to the idea of the FI as a way for laypeople to test the potential impact of (in an exceptionally crude way) of missing data or measurement error. I wouldn’t argue it’s perfect for this but I don’t think it’s a flipped around version of a p-value in those cases. I think of it as just another sensitivity analysis. I have used it in client calls in the past to when people were focused on whether a small potential mis-classification in a publication could be driving results (it wasn’t). Yes a measurement error model or PSA would be more robust but this was a nice quick way to move focus to more relevant issues.

I should have clarified this in my initial post that my issue tends to be more with the FI the way I’ve seen it discussed and used, because I agree that can be (kind of) interesting to know, in the context of NHST, that the study was “1 person away” from the opposite conclusion - for the reasons that you’ve discussed here.

The reason I’ve grown so cranky about FI in practice is that a) it’s become trendy in some areas of skeptical-medicine-Twitter to dismiss or otherwise malign the results of well-done, well-reported trials because tHe rEsULtS aRe FrAgiLe and b) it’s become a cottage industry of sorts to write papers showing that trial results are fragile, which…well, referring to my above comment, if people don’t think results “just under” p<0.05 are strong evidence, then they should insist on a field-wide shift to lower alpha levels or Bayesian probabilities with high thresholds for approval / “positive” trials. Along with meta-analyses, this has become the go-to for anyone that wants to write an “easy” paper - hurriedly toss together a review of some trials, calculate the FI for each, write a paper saying that too many trials are fragile.

we should be clear though that that’s not what we’re objecting to. I have never seen the FI used that way, or described for that purpose and it doesn’t fit with the definitoon of FI, and it wouldnt be recongised as FI and etc.

i anticipate that something like that re-definition may occur though, ive seen it before: when a proposed statistic starts to lose favour they broaden the definition until it encapsulates a more widely accepted statistic, thus making it indistinguishable, as if it was intended all along. Famous cardiologist Milton Packer i believe did this with his personal ‘clinical composite’ which he later described as a special case of the global rank

I think the interpretation in terms of patients lost to follow-up has been part of the definition at least since this pub: https://www.jclinepi.com/article/S0895-4356(13)00466-6/fulltext. They cite two earlier papers from the 90s but I always assumed this paper was the one that kicked off the current craze (and don’t have access to the earlier pubs!), it certainly dominates in turns of cites.

The person who introduced it to me must have come up the the misclassification application.

i didn’t read the full pdf, just their defn: “The Fragility Index was calculated by adding an event from the group with the smaller number of events (and subtracting a nonevent from the same group to keep the total number of patients constant) and recalculating the two-sided P -value for Fisher’s exact test.”

and that’s the defn im familiar with, i didnt see anything about lost-to-follow up. But if they did replace drop outs with an event then it’s just a worst case scenario imputation, or a perverse locf. Theyre blending and confusing all these ideas together? sorry, maybe it deserves more attention, but my guess is i shouldnt waste my time with these papers…

The lost to follow-up bit isn’t included in the math, just how they interpret it. For example from the abstract:

Results

The 399 eligible trials had a median sample size of 682 patients (range: 15–112,604) and a median of 112 events (range: 8–5,142); 53% reported a P -value <0.01. The median Fragility Index was 8 (range: 0–109); 25% had a Fragility Index of 3 or less. In 53% of trials, the Fragility Index was less than the number of patients lost to follow-up.

They don’t do this, but they do interpret the results in terms of drop outs e.g. could those patients describe the flip. It is essentially worst case scenario imputation as far as I’ve always understood it.

I am surprised by some strong responses to the use of the “Fragility Index”. The non-identifiable intention of the proponents or users of FI should have no bearing on one’s judgement of its worthiness. To be honest, I’ve never heard about the FI before, and thought this was about some way to measure clinical prognosis. I believe the FI is both sound, because it points to an important issue on causal inference (more below), and at the same time flawed, because relying on p-values results in a circular argument (no more on this).
The FI points to the issue of how to make causal inferences in a context of limited data. An example of inference in the context of limited data is the classic 2x2 table with an empty cell or with a very small number of observations in a cell. This is relevant for randomized clinical trials, because randomization allows us to simplify the analysis to a 2x2 table. An estimate of the effect of the treatment, and p-values, could be obtained from a table with sparse data. However, the estimate of the effect would be extremely large and will have a very wide confidence interval. In consequence, most people would give little weight to the findings of such a study. Thus, it seems the FI tell us little we can not infer from the point estimate and the confidence interval for the effect measurement (risk ratio or rate ratio). Moreover, the FI does not provide a way to address the problem and, therefore, contributes little to the advancement of science. Fortunately, simple Bayesian approaches have been available for decades (such as the Haldene correction) and could easily address the problem of “fragility”. Basically, if one is concerned about the “fragility” of a trial (I guess they meant the “fragility of the findings of a trial”) one could pool them with findings from similar trials or with our educated guesses to make better (i.e. more robust) inferences about the effect of the intervention. Briefly, the FI does not identify a new problem, nor provides guidance on how to address it.

So long as we are going to remain in the framework of Neyman, Pearson, and Fisher, why not simply perform some p value transform for a direct measure of evidence for the alternative such as the \Phi(p) where p is one-tailed?

Another alternative (that Sander Greenland proposes) is log_{2}(p), that provides “bits of surprise.”

Each of these measures has the benefit that multiple studies can be added together into an overall evidential summary, at least in terms of the direction of the effect (if one interprets it as a measure of evidence in favor direction in all studies), or existence (there was at least 1 “real” effect in N studies.

The Fragility Index draws attention to the arbitrary “significance” level. The significance level is defined by the utility of the experimenter. It is not a mathematically correct (indirect) evidence metric as a function of the data, as Fisher intended the p value to be.

References:
Sander Greenland, Invited Commentary: The Need for Cognitive Science in Methodology, American Journal of Epidemiology , Volume 186, Issue 6, 15 September 2017, Pages 639–645, https://doi.org/10.1093/aje/kwx259

Futschik, A., Taus, T., & Zehetmayer, S. (2018). An omnibus test for the global null hypothesis. Statistical Methods in Medical Research . https://doi.org/10.1177/0962280218768326

I hoped not to get into a discussion about p-values. I still do, though it may be fun and instructive (for myself at least). Φ§ and log2§ may help in the interpretation of p-values, at least to avoid arbitrary cut-points to interpret p-values. However, I believe if we suddenly changed from p-values to log2§, people would start asking for the value of log2§ that makes something “surprising”. We may end up with a cut-point for “surprised” replacing the cut-point for p-values. Another limitation with these approaches is that they do not incorporate current knowledge in the process of inference. Well, to be fair, that’s not a problem with Φ§ and log2§, but with us as users. In contrast, a Bayesian approach allows the explicit incorporation of previous knowledge. A log2§ of 8, for instance, may surprise some researchers, but not others. This would greatly depend on their substantive knowledge. Many researchers became surprised about a high prevalence of microcephaly in Brazil, during 2016. Had they been asked to consider (explicitly) what was known about microcephaly when judging the current prevalence, they wouldn’t have been surprised or alarmed (I guess …).

You make valid points. I have no problem with the Bayesian approach; but users in my field (allied health), and many others, aren’t as mathematically prepared as they would need to be to adopt Bayesian methods, unfortunately.

It is the mathematical maturity needed that slows the adoption of Bayesian and Likelihood methods. Much research in Bayesian and Likelihood theory involves computation of integrals with no analytic solution AFAICT.

With classical methods, you simply need to compute tail probabilities of a known distribution, bootstrap for a confidence/compatibility interval, or do a permutation test if you have a randomized experiment.

Like many others, they don’t even get NHST correct. The p value transforms at least get them started on the right path for better methods (Bayesian and Likelihood) without sacrificing rigor.