A Statistician Reads JAMA

Scott Berry examines a recent RCT that was controversial due to the adjustments for multiple tests, and likely resulted in a rejection of a potential beneficial treatment.

7 Likes

I guess replication is still essential for confirming evidence.

Lovely demonstration of the mess that frequentist stats leads us into!
This sort of thing is all too common.

2 Likes

I don’t think this is entirely fair to characterize all of “frequentist” stats like this. This is a result of an institutionalizing certain interpretations of frequentist procedures that have now long veered off into misinterpretations.

Now if by “frequentist” you mean “fixed level \alpha testing”, then I agree wholeheartedly.

Note: None of the following should be interpreted as any defense of p value methods in general. I just point out that

  1. they are reasonable shortcuts in certain areas and certain stages of scientific research and
  2. they are in widespread use, so one needs to know how to interpret them, and know when the reports cananot possibly be correct, or conflict with the narrative of the report.

The problem as far as I can tell, is that all statistical theory recommends is that in order to keep the relationship between the admissible frequentist procedure that reports a p value, and the Bayesian Posterior intact, p values from the frequentist method need to be adjusted for multiple looks at the data. I came to that conclusion after comparing Royall’s discussion of 1/k bound on the likelihood ratio of favoring B to A when A is true, to Tippett’s minimum p value procedure:

\alpha^* = 1- (1 - \alpha)^{1/k}

where a^* is the error (or assertion) probability conditional on the reference null that all studies came from the same distribution of no effect, and \alpha is the error (or assertion) probability from a single study.

There are more formal, and probably better ways of proving this. How to maintain a fixed error rate on the entire procedure can be done in a multitude of ways.

There needs to be a movement that revisits current scientific norms in medical research in light of modern computing power, and a decision theoretic approach that incorporate multiple stakeholders, not just one, who gets to specify the cost function of a procedure, that currently escapes examination.

References

Greenland, S. (2021). Analysis goals, error‐cost sensitivity, and analysis hacking: Essential considerations in hypothesis testing and multiple comparisons. Paediatric and perinatal epidemiology, 35(1), 8-23.

Greenland, S., & Hofman, A. (2019). Multiple comparisons controversies are about context and costs, not frequentism versus Bayesianism. European journal of epidemiology, 34, 801-808.

Berry, D. A., & Hochberg, Y. (1999). Bayesian perspectives on multiple comparisons. Journal of Statistical Planning and Inference, 82(1-2), 215-227. https://www.sciencedirect.com/science/article/abs/pii/S0378375899000440?via=ihub

4 Likes

You’re right - I should have said “significance testing.”

3 Likes

I really don’t think it’s fair to blame Freq Stats (or Significance Testing) as it’s actually just an incorrect conclusion. If you’re Bayesian, this would be like blaming Bayesian Statistics when the same thing occurs when using Bayes Factor.

I agree to a point, but what would a correct conclusion look like? “The data were not sufficiently extreme to reject the null hypothesis of no difference, given the arbitrary criterion for rejection that we set.”
That’s not useful.

I’m personally a fan of the perspective taken in this rejection ratio perspective taken by Berger, Bayarri, et. al in this paper: Clearly, an estimation framework has advantages, but this is a good start to reforming practice.

Simply allowing people to recognize that \alpha is a subjective input that is related to prior beliefs and the power of the experiment, seems like it would be a big help.

M.J. Bayarri, Daniel J. Benjamin, James O. Berger, Thomas M. Sellke, (2016)
Rejection odds and rejection ratios: A proposal for statistical practice in testing hypotheses,
Journal of Mathematical Psychology, Volume 72 link

Abstract: Much of science is (rightly or wrongly) driven by hypothesis testing. Even in situations where the hypothesis testing paradigm is correct, the common practice of basing inferences solely on p-values has been under intense criticism for over 50 years. We propose, as an alternative, the use of the odds of a correct rejection of the null hypothesis to incorrect rejection. Both pre-experimental versions (involving the power and Type I error) and post-experimental versions (depending on the actual data) are considered. Implementations are provided that range from depending only on the p-value to consideration of full Bayesian analysis. A surprise is that all implementations–even the full Bayesian analysis–have complete frequentist justification. Versions of our proposal can be implemented that require only minor modifications to existing practices yet overcome some of their most severe shortcomings.

This paper, along with Michael Evan’s paper on relative belief below, teach how to interpret frequentist reports from a Bayesian point of view.

Evans, M. (2016). Measuring statistical evidence using relative belief. Computational and structural biotechnology journal, 14, 91-96. link

A fundamental concern of a theory of statistical inference is how one should measure statistical evidence. Certainly the words “statistical evidence,” or perhaps just “evidence,” are much used in statistical contexts. It is fair to say, however, that the precise characterization of this concept is somewhat elusive. Our goal here is to provide a definition of how to measure statistical evidence for any particular statistical problem. Since evidence is what causes beliefs to change, it is proposed to measure evidence by the amount beliefs change from a priori to a posteriori. As such, our definition involves prior beliefs and this raises issues of subjectivity versus objectivity in statistical analyses. This is dealt with through a principle requiring the falsifiability of any ingredients to a statistical analysis. These concerns lead to checking for prior-data conflict and measuring the a priori bias in a prior.

A bit of algebra shows a direct relationship between the Rejection Ratio by Berger, and the Relative Belief Ratio by Evans. The frequentist interpretation of the likelihood ratio appears as a ratio of error probabilities. A little bit more algebra shows how to derive your personal \alpha given your priors, and what you know about the data collection process.

If you add Sander Greenland’s papers on cost functions and multiple comparisons, you don’t even need to interpret the rejection ratio as a prior. It can be derived purely in terms of cost functions, which subsume discussions of prior probability independent of cost, which really can’t be done.

Bock, M. E. (2004). Conversations with Herman Rubin. Lecture Notes-Monograph Series, 408-417. (link)

Quoting Herman Rubin on Bayesian Robustness:

One of the difficulties of Bayesian analysis is coming up with a good prior and loss function. (I have been saying for years that the prior and the loss cannot be separated. …)

You could say, following Neyman, that no evidence against the test hypothesis was found. I disagree that this isn’t useful. It will be helpful for future research to know that this test failed. If there was really an effect, then we can try to figure out what was wrong with the experiment’s design.
You mention “arbitrary criterion”, and I wonder if you mention this to highlight something negative. If you are a Bayesian, I would imagine you can acknowledge your priors are just as arbitrary.

This isn’t even correct when you consider “failures to reject” can be aggregated in a meta-analysis, even on the basis of p values. It is precisely this decision interpretation that is causing the problems that have been documented for 50+ years now.

To me, Greenland’s response is showing a preference against Dichotomous Interpretations (which is fine by me to have as a preference). But, I disagree with you that what I said is incorrect.

You said above regarding studies that fail to reject the null hypothesis:

Fisher did not agree with that interpretation, and for good reason.

If you have 3 studies of p=0.06, (two sided) all the same sign, do you still have no evidence or information against the hypothesis of 0?

Can you point to me where Fisher did not agree with that interpretation? Fisher’s interpretation of p-values is different from N-P right? I wouldn’t be surprised by his disagreement.
To your question, I think it’s crucial to distinguish between the formal outcome of a Frequentist hypothesis test and one’s personal scientific assessment. I believe you’re asking me about the latter. In this case of 3 studies failing to reject, I don’t see the problem with thinking there is some credence toward that hypothesis.

Neyman advised setting different \alpha levels for different experiments. His test also formally used a likelihood ratio, not a p-value.

Neyman, J. (1977). Frequentist probability and frequentist statistics. Synthese, 36(1), 97-131. PDF

The elements common to all the situations typified by situation 5, will
be: (1) a hypothesis Ht to be tested against an alternative Ha and (2) a
subjective appraisal of the relative importance of the two kinds of error,

This gets overshadowed by idea that 0.05 is some “objective” line of demarcation, which is total nonsense and irrational. He was concerned about rules for rational behavior not reasoning under uncertainty, as Fisher was.

Fisher’s closely related, but distinct procedure only specifies 1 hypothesis and a sample size. He noted that p-values could be combined to detect departures from the reference null, even if there was no clear signal in any single study.

He was more willing to consider that the mathematical assumptions that specify the data generation process may not be entirely correct, reducing the ability to detect a signal at the individual study level, but not eliminating the information contained in the study.

The following has a quote from Fisher on combined tests:

On information grounds alone, a 2 sided p value ignores the sign, so we can halve the reported p value to 0.03.

Based on the base 2 log transformation of p values and the assumption that the statistics from the reported studies came from a distribution N(0,1), we have 3 \times -log_2(0.03) \approx 15 bits of information against that hypothesis.

Alternatively, we could say 15 bits of information is lost by using the null reference model to compress the observed data. That is a huge amount of information loss. The Higgs Boson discovery in physics reported 22 bits of information against the reference null.

It is not correct to say any study that “fails to reject” contains “no evidence” (aka. information).

I would imagine you can acknowledge your priors are just as arbitrary.

Er, absolutely not!

Regarding the last 10 or so posts, I shudder whenever I see hypotheses tested. I want evidence for clinically meaningful assertions, not tests of a straw man.

2 Likes

You can use a p-value instead of a CV. If you use it for the N-P tests, where it’s a proxy. Which is different from how it’s used with Fisherian Tests. I think we are at in impasse here.
Here is my reference for not rejecting to be taken as no evidence against H found:
Neyman J. Tests of statistical hypotheses and their use in stud-ies of natural phenomena. Commun Stat Theory Methods.1976;5(8):737‐751.

“Fisher did not agree with that interpretation, and for good reason.”
Fisher 1926
" …we may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level . A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance. The very high odds sometimes claimed for experimental results should usually be discounted, for inaccurate methods of estimating error have far more influence than has the particular standard of significance chosen."

Do you have an argument for Bayesian Priors not being arbitrary, while Freq Alpha levels are?

NP is a theory of decision – to act “as if” there were no evidence in the observed data. Using decision procedures as evidence metrics entails having the subjective evaluation of the two types of error decided by the experimenter. (From Neyman’s own words.)

Why do you complain about a visible prior, when your decision procedure has a hidden utility function?

Using the numerical example (3 2 sided p values of 0.05 or 1 sided at 0.03), do you really think it is a good idea to act as if 3 experiments came from the same distribution of 0 effect, and lose 15 bits of information? This denies the ability of meta-analyses to increase power to detect things that might not have been evident in the individual studies.

Greenland, S. (2023). Divergence versus decision P‐values: A distinction worth making in theory and keeping in practice: Or, how divergence P‐values measure evidence even when decision P‐values do not. Scandinavian Journal of Statistics, 50(1), 54-88. link

" NP is a theory of decision – to act “as if” there were no evidence in the observed data. Using decision procedures as evidence metrics entails having the subjective evaluation of the two types of error decided by the experimenter. (From Neyman’s own words.)"

I have Neyman’s book as a PDF. I don’t feel like you need to keep adding information that isn’t necessary. Unless you’re correcting me on something I’ve said, which isn’t clear to me.

I’m not sure who complained about Bayesian Priors, so I’m not sure who this is for. I don’t recall giving you my decision procedure either.

I’m aware of Greenland’s paper. Still haven’t finished it yet.