I am intrigued by a recent publication in JAMA titled Evidence of Lack of Treatment Efficacy Derived From Statistically Nonsignificant Results of Randomized Clinical Trials that proposes interpreting negative trials using likelihood ratios, finding that many negative trials provide substantial evidence for the null hypothesis of no treatment effect. Some of the nice properties of using this approach are summarized in an editorial by Roger J. Lewis titled Revisiting the Analogy Between Clinical Trials and Diagnostic Tests by Interpreting a Negative Trial as a Negative Test for Efficacy . As (primarily) a clinician, this approach seems very intuitive and relatively easy to understand. I would love to participate in a discussion regarding the merits of this approach and specifically, its relationship to P-value functions and S-values.
@Robert_Matthews has already published a better method of doing this from the Bayesian perspective using the reported confidence intervals. Matthew’s Analysis of credibility procedure can give a wide range of plausible values for an effect that could justify continued research into a topic, in spite of a failure to achieve “significance” in a set of studies.
From the original link:
Among 169 statistically nonsignificant primary outcome results of randomized trials published in 2021, the hypotheses of lack of effect (null hypothesis) and of clinically meaningful effectiveness (alternate hypothesis) were compared using a likelihood ratio to quantify the strength of support the observed trial findings provide for one hypothesis vs the other; about half (52.1%) yielded a likelihood ratio of more than 100 for the null hypothesis of lack of effect vs the alternate.
The premise of the article is attempting to defend a logical fallacy with empirical data.
They just picked 2 arbitrary points on a likelihood function (each that have precisely 0 probability of being true from a strict axiomatic perspective) to defend the idea that nonsignificant studies provide evidence in favor of the null. I can’t access the paper because of the paywall, but the use of the term “likelihood ratio” vs. “likelihood function” makes me suspicious this is what was done.
It is flawed because of:
-
Coming to conclusive results based on single studies directly contradicts the idea that meta-analysis can magnify power and precision of any test or estimate. Any experiment is a simultaneous test of the experimental procedure in addition to the scientific question. Neyman denied the possibility of drawing firm conclusions in individual cases, and interpreted outputs as decisions for behavior only. AFAIK, even Fisher insisted on replication of experiments that yielded positive tests of significance.
-
Their meta-analysis is flawed for all of the reasons @Sander alluded to in that giga-thread on odds ratios. Using empirical data to justify a statistical procedure is at best, exploratory.
Key quote:
Without some extraordinary considerations, we do not transport estimates for the effect of antihistamines on hives to project effects of chemotherapies on cancers, nor do we combine these disparate effects in meta-analyses.
LIkewise, we don’t transport likelihoods from heterogeneous studies either.
Related Reading
The discussion and some of the links are worth follow up.
Here is an interesting discussion on Andrew Gelman’s blog on 4 interpretations of p-values that is related to this paper. @Sander_Greenland has a number of valuable comments in this thread as well.
The JAMA article has a flaw. To bring evidence for similarity we need a similarity zone for the treatment effect from each study. The authors took the effect size used in the power calculation as the similarity boundary. This is not correct in general; the similarity boundary needs to be smaller than the powered effect size. And the Bayesian approximation to all this works so much better than likelihood ratios.
There is also a key problem in using the two-discrete-point likelihood ratio. For a choice of two effect sizes, such likelihood ratios quantify the relative support for one effect size vs. the other. One effect size may appear to be much more likely, but it could be way off; it’s just better than the other point which is even farther off. In other words one can have false confidence in the more likely point (actually this is the point that makes the data more likely) when using a relative method such as likelihood ratios. The Bayesian approach does not have this problem and provides much more intuitive results. The paper should have IMHO used a simple Bayesian approximation throughout. Note that I was a reviewer on the paper and I detailed all these issues in my review.
Presumably, the likelihood function for any alternative hypothesis can be computed and statements made such as “The likelihood ratio supports the null hypothesis of no treatment effect compared to a range of plausible smallest meaningful clinical effect sizes”?
Yes, the key idea being good in selecting a threshold for non-trivial clinical effects. Buy why use a method that is less intuitive and actionable than Bayes?
This subtle point you bring up reminded me of Jeffrey Blume’s interval testing procedure he calls “second generation p-values”. While I think the procedure deserves a more descriptive name, with “p-*” being too overloaded in stats and math, I think it is another useful way to extract value from frequentist results.
Jeffrey D. Blume, Robert A. Greevy, Valerie F. Welty, Jeffrey R. Smith & William D. Dupont (2019) An Introduction to Second-Generation p-Values, The American Statistician, 73:sup1, 157-167 (link)
I’d strongly prefer a nice clean separation of the actual data, whether it be interval estimates, likelihood functions, or “confidence” distributions, from the storytelling in the discussion and methods.
Related thread
Great points, very much agree (haven’t read the paper in detail yet though).
What’s annoying/frustrating is that the habit of referrring to trials with a “non-significant” primary outcome is impossible to kill. Here it is again in JAMA.
We once explored a similar likelihood-based approach (second generation p-values) to interpret the pivotal RCTs in kidney cancer. It really did not catch on for various reasons, one being the need to specify a null hypothesis interval. We will still sometimes toy around with similar concepts as needed (see “indifference zone” here).
Concurrently in 2019 we began exploring the use of S-values. This worked far better. Much more intuitive, actionable, and clinicians, biostatisticians, and other researchers love it.
Devil’s advocate: if in the future I publish a pragmatic RCT and a priori define and justify a minimally clinically important treatment effect given cost and resource constraints, and the trial turns out negative. Why should I not present the trial using this framework?
I think this comment addresses your question:
Why can’t you just present the likelihood function or p value curve (which can be used to extract the likelihood function) for all \alpha levels itself? What is the attraction of dichotomizing the likelihood function? State your interpretation, but present all of the data so the reader can judge for him/herself.
Thank you. I suppose my answer was just poorly worded, I personally would always just present the likelihood function and describe my interpretation in the main text based on an already defined minimally important effect size of interest. Having said this, I am trying to understand why the likelihood function in particular is better or worse than a p-value function or the Bayesian approach that you recommended in the first comment (that I am slowly working through, thank you for the suggestion).
There are 3 different issues involved here:
-
In terms of the JAMA likelihood ratio approach, my assessment from the abstract is that the authors were attempting to justify the fallacy p(\hat\theta | \theta_0) > 0.05 \implies p(\theta) = 0 . The reverse Bayes approach described by Matthews illustrates that there is a distribution of outcomes, and there usually isn’t enough information in a single study to come to a definitive conclusion that the fallacious use of hypothesis testing implies.
-
In terms of communicating the information from a single experiment, the likelihood function may be preferable in the sense that it is easy to use in a meta-analysis. A complete p-value curve can be used to compute a likelihood function, and in some cases, it can be preferable to combine entire confidence curves directly. I think likelihood functions and p-value curves ( or the equivalent “confidence distributions”) are complementary. Importantly, If readers disagree with your choice of MCID, they can re-do the analysis using your summary stats if presented as likelihood functions and confidence distributions.
-
There are different perspectives on data analysis in statistics. While there are strong normative arguments in favor of a Bayesian approach, the point of science is to have data collection procedures that lead honest and principled scholars to converge on the the true state of nature, at least temporarily ending dispute on a point of fact. Likelihood ratios and confidence distributions are a compromise that allows Bayesians and Frequentists to communicate productively.
Schweder, T., & Hjort, N. L. (2013, August). Integrating confidence intervals, likelihoods and confidence distributions. In Proceedings of the 59th World Statistics Congress, 25-30 August 2013, Hong Kong, Amsterdam International Statistical Institute (Vol. 1, pp. 277-282).
Take al look at the R package for meta-analysis based on confidence distributions:
To reiterate some of the thread, presentation of the full likelihood function (as a function of the unknown treatment difference) is a great idea and shows the relative compatibility of possible effects and the data. The likelihood function is equivalent to a Bayesian posterior probability distribution under a non-informative (flat) prior distribution on a specific parameter scale. Since all values of the effect are not equally likely, this is not so reasonable. For example we know that therapies in chronic disease are not curative (except maybe in hepatitus C) so it is not reasonable to allow the odds ratio to be zero, but would be reasonable to put a normal prior on the odds ratio that makes it unlikely for the OR to be < 0.25. So going full Bayes is still recommended, and is easier to interpret.
Besides the JAMA article misusing likelihood ratios, the bigger problem is the failure of clinical trial designers to use MCIDs in power calculations, thereby frequently launching futile clinical trials. The JAMA article assumed that effects used i power calculations are MCIDs, which is probably true about a third of the time.