Fragility Index

we should be clear though that that’s not what we’re objecting to. I have never seen the FI used that way, or described for that purpose and it doesn’t fit with the definitoon of FI, and it wouldnt be recongised as FI and etc.

i anticipate that something like that re-definition may occur though, ive seen it before: when a proposed statistic starts to lose favour they broaden the definition until it encapsulates a more widely accepted statistic, thus making it indistinguishable, as if it was intended all along. Famous cardiologist Milton Packer i believe did this with his personal ‘clinical composite’ which he later described as a special case of the global rank

I think the interpretation in terms of patients lost to follow-up has been part of the definition at least since this pub: https://www.jclinepi.com/article/S0895-4356(13)00466-6/fulltext. They cite two earlier papers from the 90s but I always assumed this paper was the one that kicked off the current craze (and don’t have access to the earlier pubs!), it certainly dominates in turns of cites.

The person who introduced it to me must have come up the the misclassification application.

i didn’t read the full pdf, just their defn: “The Fragility Index was calculated by adding an event from the group with the smaller number of events (and subtracting a nonevent from the same group to keep the total number of patients constant) and recalculating the two-sided P -value for Fisher’s exact test.”

and that’s the defn im familiar with, i didnt see anything about lost-to-follow up. But if they did replace drop outs with an event then it’s just a worst case scenario imputation, or a perverse locf. Theyre blending and confusing all these ideas together? sorry, maybe it deserves more attention, but my guess is i shouldnt waste my time with these papers…

The lost to follow-up bit isn’t included in the math, just how they interpret it. For example from the abstract:

Results

The 399 eligible trials had a median sample size of 682 patients (range: 15–112,604) and a median of 112 events (range: 8–5,142); 53% reported a P -value <0.01. The median Fragility Index was 8 (range: 0–109); 25% had a Fragility Index of 3 or less. In 53% of trials, the Fragility Index was less than the number of patients lost to follow-up.

They don’t do this, but they do interpret the results in terms of drop outs e.g. could those patients describe the flip. It is essentially worst case scenario imputation as far as I’ve always understood it.

1 Like

I am surprised by some strong responses to the use of the “Fragility Index”. The non-identifiable intention of the proponents or users of FI should have no bearing on one’s judgement of its worthiness. To be honest, I’ve never heard about the FI before, and thought this was about some way to measure clinical prognosis. I believe the FI is both sound, because it points to an important issue on causal inference (more below), and at the same time flawed, because relying on p-values results in a circular argument (no more on this).
The FI points to the issue of how to make causal inferences in a context of limited data. An example of inference in the context of limited data is the classic 2x2 table with an empty cell or with a very small number of observations in a cell. This is relevant for randomized clinical trials, because randomization allows us to simplify the analysis to a 2x2 table. An estimate of the effect of the treatment, and p-values, could be obtained from a table with sparse data. However, the estimate of the effect would be extremely large and will have a very wide confidence interval. In consequence, most people would give little weight to the findings of such a study. Thus, it seems the FI tell us little we can not infer from the point estimate and the confidence interval for the effect measurement (risk ratio or rate ratio). Moreover, the FI does not provide a way to address the problem and, therefore, contributes little to the advancement of science. Fortunately, simple Bayesian approaches have been available for decades (such as the Haldene correction) and could easily address the problem of “fragility”. Basically, if one is concerned about the “fragility” of a trial (I guess they meant the “fragility of the findings of a trial”) one could pool them with findings from similar trials or with our educated guesses to make better (i.e. more robust) inferences about the effect of the intervention. Briefly, the FI does not identify a new problem, nor provides guidance on how to address it.

So long as we are going to remain in the framework of Neyman, Pearson, and Fisher, why not simply perform some p value transform for a direct measure of evidence for the alternative such as the \Phi(p) where p is one-tailed?

Another alternative (that Sander Greenland proposes) is log_{2}(p), that provides “bits of surprise.”

Each of these measures has the benefit that multiple studies can be added together into an overall evidential summary, at least in terms of the direction of the effect (if one interprets it as a measure of evidence in favor direction in all studies), or existence (there was at least 1 “real” effect in N studies.

The Fragility Index draws attention to the arbitrary “significance” level. The significance level is defined by the utility of the experimenter. It is not a mathematically correct (indirect) evidence metric as a function of the data, as Fisher intended the p value to be.

References:
Sander Greenland, Invited Commentary: The Need for Cognitive Science in Methodology, American Journal of Epidemiology , Volume 186, Issue 6, 15 September 2017, Pages 639–645, https://doi.org/10.1093/aje/kwx259

Futschik, A., Taus, T., & Zehetmayer, S. (2018). An omnibus test for the global null hypothesis. Statistical Methods in Medical Research . https://doi.org/10.1177/0962280218768326

I hoped not to get into a discussion about p-values. I still do, though it may be fun and instructive (for myself at least). Φ§ and log2§ may help in the interpretation of p-values, at least to avoid arbitrary cut-points to interpret p-values. However, I believe if we suddenly changed from p-values to log2§, people would start asking for the value of log2§ that makes something “surprising”. We may end up with a cut-point for “surprised” replacing the cut-point for p-values. Another limitation with these approaches is that they do not incorporate current knowledge in the process of inference. Well, to be fair, that’s not a problem with Φ§ and log2§, but with us as users. In contrast, a Bayesian approach allows the explicit incorporation of previous knowledge. A log2§ of 8, for instance, may surprise some researchers, but not others. This would greatly depend on their substantive knowledge. Many researchers became surprised about a high prevalence of microcephaly in Brazil, during 2016. Had they been asked to consider (explicitly) what was known about microcephaly when judging the current prevalence, they wouldn’t have been surprised or alarmed (I guess …).

1 Like

@lbautista

You make valid points. I have no problem with the Bayesian approach; but users in my field (allied health), and many others, aren’t as mathematically prepared as they would need to be to adopt Bayesian methods, unfortunately.

It is the mathematical maturity needed that slows the adoption of Bayesian and Likelihood methods. Much research in Bayesian and Likelihood theory involves computation of integrals with no analytic solution AFAICT.

With classical methods, you simply need to compute tail probabilities of a known distribution, bootstrap for a confidence/compatibility interval, or do a permutation test if you have a randomized experiment.

Like many others, they don’t even get NHST correct. The p value transforms at least get them started on the right path for better methods (Bayesian and Likelihood) without sacrificing rigor.

You have a point. However, our difficulties with the Bayesian approach also have a lot to do with their absence from our training.

Conceptually the FI seems to resemble the E-value for observational studies. Does the objection and response to the E-value help you in this discussion? E.g., regarding flipping of P-values.

1 Like

Not sure about that. I think that analyzing data using Bayesian methods actually requires less math and simpler concepts. It does require more computation, with less availability of off the shelf software (which is rapidly changing thanks to things like the R brms package). The complex multidimensional integrals have been almost completely replaced with numerical solutions that the use doesn’t see except that they need to run diagnostics on the simulated posterior distributions to check convergence.

2 Likes

Do you think Bayesian methods can be taught to students who have not taken calculus? The vast majority of people who take a statistics or research methods do not take any significant amount of mathematics as far as the allied health fields go. So the textbooks remain at the level of elementary algebra and table look up, or the use of “point and click” analyses.

Doctoral level students may take a few research methods classes taught by psychologists or other practitioners.

A few books attempt to teach calculus and stats side by side (IE. Hamming’s Methods of Mathematics, or Gemignani’s Calculus and Statistics)

An interesting approach was Edward Nelson’s Radically Elementary Probability Theory – written from the perspective of nonstandard analysis. (PDF)

Absolutely. The only integration that is central to Bayes is the one that normalizes the simple product of prior and data likelihood, to get a posterior that integrates to one. MCMC made this unnecessary since one can draw samples from the posterior without computing the messy integral that is the denominator just described. The key to understanding Bayes is understanding conditional probability.

2 Likes

Thanks all for this great discussion.
This recent article estimated Fragility Index for phase 3 trials supporting drug approvals in oncology. In addition to all the concerns about the use of FI that were mentioned in this discussion, I think that computing FI with the Fisher’s exact test (as initially done in Walsh et al.) will bias the fragility index towards 0, since at the end of the trial almost all patients would have experienced outcome such as progression-free survival (PFS). Hence the difference in percentages of event will be small.

I share with you some concerns about the use of fragility index, but if one would still use it in the context of time-to-event outcomes, do you have ideas on how we could compute it ? I guess the fact that data are censored make things way more complicated than for dichotomous outcomes.

1 Like

yes, when you do it for survival times it becomes apparent just how tenuous it is, ie when youre forced to decide: which pts who are alive shall we treat as deceased? i don’t mind scepticism, it’s crucial, but the FI will promote scepticism beyond anything that is reasonable:

“Many phase 3 randomised controlled trials supporting FDA-approved anticancer drugs have a low fragility index, challenging confidence for concluding their superiority over control treatments.”

I would reiterate, again, that there are only MDs on that oncology paper. You don’t even need to look at the author list, it’s apparent as soon as you begin reading the findings and i think this fact has some relevance re the spread of this idea. What could be simpler than calculating FI for a bunch of studies in some disease area and conlcuding: we should all be pessimistic

2 Likes

Right. It’s become a cheap way to get papers (similar to how people that are looking for a quick / easy paper often get the idea that they can just pump out a meta-analysis by finding a recent meta-analysis and updating it with the 1 study that was published since that one came out).

I hesitate to entertain this discussion at all for reasons discussed above, but if one were trying to create an equivalent to the fragility index that would better accommodate time-to-event data, I think one possible option would be something like the following: (note: haven’t thought this all the way through)

Randomly re-assign the value of “treatment” to every combination of participants, compute the p-value for the time-to-event analysis in every scenario, then average the number of participants who were re-assigned from treatment -> control in only the scenarios where the “significant” result became “non-significant” … now that I think about this it can’t be quite right. Maybe someone else can pick up the idea and finish this train of thought…but I’m not sure it’s worth the time and effort, since I’m not really a fan of FI in the first place.

2 Likes

sounds a bit like a permutation test. I guess a worst case scenario analysis might makes sense in some circumstances eg the ones lost to follow-up early could be assumed to have died at the time of withdrawal, and it’s reminiscent of the FI.

I’d like to propose a measure for those fond of FI: take away their 3 highest cited papers, re-calculate their h-index, and thence conclude that they are unproductive. Then we can dismiss them along with all the drugs they have insinuated are inefficacious according to FI

Yeah, the ‘permutation test’ was the parallel that came to mind & made me think something like that is potentially a solution to “how do we get an equivalent to the FI using time to event data”

I’ll repeat myself from earlier in the thread: if people have a problem with “fragile” trial results, what they’re really saying (whether they realize it or not) is that a p=0.05ish result isn’t good enough for them, and rather than bitching about “fragility” they should be calling for more stringent alpha levels (or an equivalently stringent threshold for adoption from Bayesian trials, e.g. very high posterior probability of benefit > some clinically relevant threshold). Of course, this would turn trials that currently require a few hundred patients into trials that require a few thousand patients (or trials that require a few thousand patients into trials that require tens of thousands of patients) and may become cost prohibitive, so they may be inadvertently depriving future patients of potentially promising treatments that simply can’t make it through the development pipeline.

4 Likes

We had a recent Journal review of FI in Oncology RCTs. Not to my surprise, I was very skeptical of the FI, its use and its interpretation.

My understanding of the frequentist interpretation would not allow the change between groups for a different interpretation of the trial and I had problems communicating this in the journal club. It felt as if they were mixing effect sizes with p-values.

I also found this article in the JCE (1) where the authors extended the FI the meta-analysis. What I found particularly interesting was the evaluation of non-significant meta-analysis.

The median fragility index of statistically nonsignificant meta-analyses was 7 (Q1-Q3: 4-14, range 1-102). More than one-third (36.0%) of statistically nonsignificant meta- analyses had a fragility index of 5 or less…

Later they state:
In particular, nonsignificant meta-analyses with more than 1,000 events would need more than six event- status modifications in one or more specific trials to become statistically significant.

It would follow that non-significant results are also fragile, even more so than significant trials, with only six-event changes. I still have my doubts if FI can be carried to meta-analysis due to the heterogeneity of trials (patients, outcomes, measurement etc.)

This would make the FI next to useless and I would not be comfortable with using it in any circumstances to draw any conclusions.

1 Like

:slight_smile: that’s an interesting point. There is another discussion on here about the potential for discarding useful drugs: Which error is more beneficial to the society?

the FI is fascinating to me because of how rapdily it has migrated across disease areas, ie within 5 years? Often statistical fashions are peculiar ie confined to a given disease area. I wonder if this is because MDs are pushing the FI and they’re more widely read than statos who have a penchant for habits, or maybe ideas like the FI are hitchhiking on the reproducibility theme …

2 Likes