Scott Berry examines a recent RCT that was controversial due to the adjustments for multiple tests, and likely resulted in a rejection of a potential beneficial treatment.
I guess replication is still essential for confirming evidence.
Lovely demonstration of the mess that frequentist stats leads us into!
This sort of thing is all too common.
I donât think this is entirely fair to characterize all of âfrequentistâ stats like this. This is a result of an institutionalizing certain interpretations of frequentist procedures that have now long veered off into misinterpretations.
Now if by âfrequentistâ you mean âfixed level \alpha testingâ, then I agree wholeheartedly.
Note: None of the following should be interpreted as any defense of p value methods in general. I just point out that
- they are reasonable shortcuts in certain areas and certain stages of scientific research and
- they are in widespread use, so one needs to know how to interpret them, and know when the reports cananot possibly be correct, or conflict with the narrative of the report.
The problem as far as I can tell, is that all statistical theory recommends is that in order to keep the relationship between the admissible frequentist procedure that reports a p value, and the Bayesian Posterior intact, p values from the frequentist method need to be adjusted for multiple looks at the data. I came to that conclusion after comparing Royallâs discussion of 1/k bound on the likelihood ratio of favoring B to A when A is true, to Tippettâs minimum p value procedure:
\alpha^* = 1- (1 - \alpha)^{1/k}
where a^* is the error (or assertion) probability conditional on the reference null that all studies came from the same distribution of no effect, and \alpha is the error (or assertion) probability from a single study.
There are more formal, and probably better ways of proving this. How to maintain a fixed error rate on the entire procedure can be done in a multitude of ways.
There needs to be a movement that revisits current scientific norms in medical research in light of modern computing power, and a decision theoretic approach that incorporate multiple stakeholders, not just one, who gets to specify the cost function of a procedure, that currently escapes examination.
References
Greenland, S. (2021). Analysis goals, errorâcost sensitivity, and analysis hacking: Essential considerations in hypothesis testing and multiple comparisons. Paediatric and perinatal epidemiology, 35(1), 8-23.
Greenland, S., & Hofman, A. (2019). Multiple comparisons controversies are about context and costs, not frequentism versus Bayesianism. European journal of epidemiology, 34, 801-808.
Berry, D. A., & Hochberg, Y. (1999). Bayesian perspectives on multiple comparisons. Journal of Statistical Planning and Inference, 82(1-2), 215-227. https://www.sciencedirect.com/science/article/abs/pii/S0378375899000440?via=ihub
I really donât think itâs fair to blame Freq Stats (or Significance Testing) as itâs actually just an incorrect conclusion. If youâre Bayesian, this would be like blaming Bayesian Statistics when the same thing occurs when using Bayes Factor.
I agree to a point, but what would a correct conclusion look like? âThe data were not sufficiently extreme to reject the null hypothesis of no difference, given the arbitrary criterion for rejection that we set.â
Thatâs not useful.
Iâm personally a fan of the perspective taken in this rejection ratio perspective taken by Berger, Bayarri, et. al in this paper: Clearly, an estimation framework has advantages, but this is a good start to reforming practice.
Simply allowing people to recognize that \alpha is a subjective input that is related to prior beliefs and the power of the experiment, seems like it would be a big help.
M.J. Bayarri, Daniel J. Benjamin, James O. Berger, Thomas M. Sellke, (2016)
Rejection odds and rejection ratios: A proposal for statistical practice in testing hypotheses,
Journal of Mathematical Psychology, Volume 72 link
Abstract: Much of science is (rightly or wrongly) driven by hypothesis testing. Even in situations where the hypothesis testing paradigm is correct, the common practice of basing inferences solely on p-values has been under intense criticism for over 50 years. We propose, as an alternative, the use of the odds of a correct rejection of the null hypothesis to incorrect rejection. Both pre-experimental versions (involving the power and Type I error) and post-experimental versions (depending on the actual data) are considered. Implementations are provided that range from depending only on the p-value to consideration of full Bayesian analysis. A surprise is that all implementationsâeven the full Bayesian analysisâhave complete frequentist justification. Versions of our proposal can be implemented that require only minor modifications to existing practices yet overcome some of their most severe shortcomings.
This paper, along with Michael Evanâs paper on relative belief below, teach how to interpret frequentist reports from a Bayesian point of view.
Evans, M. (2016). Measuring statistical evidence using relative belief. Computational and structural biotechnology journal, 14, 91-96. link
A fundamental concern of a theory of statistical inference is how one should measure statistical evidence. Certainly the words âstatistical evidence,â or perhaps just âevidence,â are much used in statistical contexts. It is fair to say, however, that the precise characterization of this concept is somewhat elusive. Our goal here is to provide a definition of how to measure statistical evidence for any particular statistical problem. Since evidence is what causes beliefs to change, it is proposed to measure evidence by the amount beliefs change from a priori to a posteriori. As such, our definition involves prior beliefs and this raises issues of subjectivity versus objectivity in statistical analyses. This is dealt with through a principle requiring the falsifiability of any ingredients to a statistical analysis. These concerns lead to checking for prior-data conflict and measuring the a priori bias in a prior.
A bit of algebra shows a direct relationship between the Rejection Ratio by Berger, and the Relative Belief Ratio by Evans. The frequentist interpretation of the likelihood ratio appears as a ratio of error probabilities. A little bit more algebra shows how to derive your personal \alpha given your priors, and what you know about the data collection process.
If you add Sander Greenlandâs papers on cost functions and multiple comparisons, you donât even need to interpret the rejection ratio as a prior. It can be derived purely in terms of cost functions, which subsume discussions of prior probability independent of cost, which really canât be done.
Bock, M. E. (2004). Conversations with Herman Rubin. Lecture Notes-Monograph Series, 408-417. (link)
Quoting Herman Rubin on Bayesian Robustness:
One of the difficulties of Bayesian analysis is coming up with a good prior and loss function. (I have been saying for years that the prior and the loss cannot be separated. âŚ)
You could say, following Neyman, that no evidence against the test hypothesis was found. I disagree that this isnât useful. It will be helpful for future research to know that this test failed. If there was really an effect, then we can try to figure out what was wrong with the experimentâs design.
You mention âarbitrary criterionâ, and I wonder if you mention this to highlight something negative. If you are a Bayesian, I would imagine you can acknowledge your priors are just as arbitrary.
This isnât even correct when you consider âfailures to rejectâ can be aggregated in a meta-analysis, even on the basis of p values. It is precisely this decision interpretation that is causing the problems that have been documented for 50+ years now.
To me, Greenlandâs response is showing a preference against Dichotomous Interpretations (which is fine by me to have as a preference). But, I disagree with you that what I said is incorrect.
You said above regarding studies that fail to reject the null hypothesis:
Fisher did not agree with that interpretation, and for good reason.
If you have 3 studies of p=0.06, (two sided) all the same sign, do you still have no evidence or information against the hypothesis of 0?
Can you point to me where Fisher did not agree with that interpretation? Fisherâs interpretation of p-values is different from N-P right? I wouldnât be surprised by his disagreement.
To your question, I think itâs crucial to distinguish between the formal outcome of a Frequentist hypothesis test and oneâs personal scientific assessment. I believe youâre asking me about the latter. In this case of 3 studies failing to reject, I donât see the problem with thinking there is some credence toward that hypothesis.
Neyman advised setting different \alpha levels for different experiments. His test also formally used a likelihood ratio, not a p-value.
Neyman, J. (1977). Frequentist probability and frequentist statistics. Synthese, 36(1), 97-131. PDF
The elements common to all the situations typified by situation 5, will
be: (1) a hypothesis Ht to be tested against an alternative Ha and (2) a
subjective appraisal of the relative importance of the two kinds of error,
This gets overshadowed by idea that 0.05 is some âobjectiveâ line of demarcation, which is total nonsense and irrational. He was concerned about rules for rational behavior not reasoning under uncertainty, as Fisher was.
Fisherâs closely related, but distinct procedure only specifies 1 hypothesis and a sample size. He noted that p-values could be combined to detect departures from the reference null, even if there was no clear signal in any single study.
He was more willing to consider that the mathematical assumptions that specify the data generation process may not be entirely correct, reducing the ability to detect a signal at the individual study level, but not eliminating the information contained in the study.
The following has a quote from Fisher on combined tests:
On information grounds alone, a 2 sided p value ignores the sign, so we can halve the reported p value to 0.03.
Based on the base 2 log transformation of p values and the assumption that the statistics from the reported studies came from a distribution N(0,1), we have 3 \times -log_2(0.03) \approx 15 bits of information against that hypothesis.
Alternatively, we could say 15 bits of information is lost by using the null reference model to compress the observed data. That is a huge amount of information loss. The Higgs Boson discovery in physics reported 22 bits of information against the reference null.
It is not correct to say any study that âfails to rejectâ contains âno evidenceâ (aka. information).
I would imagine you can acknowledge your priors are just as arbitrary.
Er, absolutely not!
Regarding the last 10 or so posts, I shudder whenever I see hypotheses tested. I want evidence for clinically meaningful assertions, not tests of a straw man.
You can use a p-value instead of a CV. If you use it for the N-P tests, where itâs a proxy. Which is different from how itâs used with Fisherian Tests. I think we are at in impasse here.
Here is my reference for not rejecting to be taken as no evidence against H found:
Neyman J. Tests of statistical hypotheses and their use in stud-ies of natural phenomena. Commun Stat Theory Methods.1976;5(8):737â751.
âFisher did not agree with that interpretation, and for good reason.â
Fisher 1926
" âŚwe may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level . A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance. The very high odds sometimes claimed for experimental results should usually be discounted, for inaccurate methods of estimating error have far more influence than has the particular standard of significance chosen."
Do you have an argument for Bayesian Priors not being arbitrary, while Freq Alpha levels are?
NP is a theory of decision â to act âas ifâ there were no evidence in the observed data. Using decision procedures as evidence metrics entails having the subjective evaluation of the two types of error decided by the experimenter. (From Neymanâs own words.)
Why do you complain about a visible prior, when your decision procedure has a hidden utility function?
Using the numerical example (3 2 sided p values of 0.05 or 1 sided at 0.03), do you really think it is a good idea to act as if 3 experiments came from the same distribution of 0 effect, and lose 15 bits of information? This denies the ability of meta-analyses to increase power to detect things that might not have been evident in the individual studies.
Greenland, S. (2023). Divergence versus decision Pâvalues: A distinction worth making in theory and keeping in practice: Or, how divergence Pâvalues measure evidence even when decision Pâvalues do not. Scandinavian Journal of Statistics, 50(1), 54-88. link
" NP is a theory of decision â to act âas ifâ there were no evidence in the observed data. Using decision procedures as evidence metrics entails having the subjective evaluation of the two types of error decided by the experimenter. (From Neymanâs own words.)"
I have Neymanâs book as a PDF. I donât feel like you need to keep adding information that isnât necessary. Unless youâre correcting me on something Iâve said, which isnât clear to me.
Iâm not sure who complained about Bayesian Priors, so Iâm not sure who this is for. I donât recall giving you my decision procedure either.
Iâm aware of Greenlandâs paper. Still havenât finished it yet.