We make a distinction between Fishers approach and the NP approach (Fix α (e.g., 0.05), control type I error, maximize power subject to that constraint). However after reading the arguments between Fisher and NP, I tend to agree with Fisher - beyond sample size calculations and power, there is no real utility of NP’s approach. All “statistical tests” are indeed Fisherian and NP just gave us power that is irrelevant to observational studies and perhaps may help improve efficiency in experimental studies.
In Ronald Fisher’s framework, the p-value measures strength of divergence from the hypothesized null, there is no alternative hypothesis formally specified, there is no type II error, there is no explicit loss function and there is no fixed long-run decision rule. The p-value answers: P(data as or more extreme∣H0). Fisher viewed this as a continuous measure of divergence, not a mechanical accept/reject device.
Finally, decision theoretic extensions of the NP framework are used in medicine mainly in clinical decision analysis, health economics, and policy modeling, not in everyday p-value reporting.
So why do we need any of these ideas in relation to p-values - why not just agree with Fisher?. Any thoughts would be of keen interest.
Well put except for “why not just agree with Fisher”. Though I believe the Fisherian approach is better, I still think it’s awful. Null hypotheses are artificial straw-man constructs that do not serve most research goals well. That’s because answering questions is more suitable to scientific endeavor than testing hypotheses, other than when existence hypotheses are of central issue as in particle physics or ESP research. A common type of question that is very relevant is “how much risk reduction will patients get if they take a statin rather than placebo?”. This leads to estimation and Bayesian evidence quantification related to the veracity of every possible level of risk reduction.
The biggest defects in Fisher’s approach is how poorly p-values deal with questions of real interest and how they must take investigator intentions into account, due to the fact that p-values are probabilities about data and not about unknowns. For example, two investigators can analyze the same dataset and get different results when one investigator analyzed the data only at the planned study end, while the other investigator also did an interim look that was inconsequential.
By computing P(data more extreme than the observed data | H0) Fisher thought that the exercise was fully objective and scientific. For a moment he pushed a more relevant quantity P(getting observed data | H0) but soon realized all these probabilities are tiny or zero, hence had to pool the question of interest with other possibilities to have the chance of getting a large p-value. The best contribution from Fisher at this point was his statement that a large p-value should only be interpreted as “get more data” and should not be used as evidence for H0.
I would agree completely with you on these interpretative issues and the broader issue of utility. My point is that we nevertheless have these questions pop up as p values and CIs are not going away any time soon. My question is that if I am asked what is the utility of type I and II errors and alternate hypotheses over and above whatever we get from Fisherian p values, then the answer is nothing. I agree that p values deal very poorly with questions of real interest to us, but if someone asks for an explanation of these concepts I plan from now on to say that what NP proposed has not even got that nominal inferential value because its purpose is not inference (at least from my viewpoint). And the final question regarding the NP approach - why would I want to know the probability that I would reject the null hypothesis at this sample size? It is not relevant since it just amplifies the misconception that rejection of a tested hypothesis implies something meaningful, which it does not.
Nice topic. I suspect that these historical arguments between Fisher and Neyman/Pearson might be taught very poorly (or not at all), even though students will never be able to understand the problems with hypothesis testing if they haven’t deeply understood the arguments.
A few weeks ago, I came across this great article written by Steven Goodman back in 1993:
He describes the divergent views of Fisher and Neyman/Pearson and the reasons for their disagreements. I read the article several times. It hurt my brain but I hope I understand things a bit better now (?) I had to summarize the gist of the article for myself, for future reference, since the ideas are pretty abstract.
Side issue: The incredible amount of emphasis on controlling \alpha when it has nothing to do with the probability of making a decision error has really hurt science. I guess this could be viewed as an NP thing, or perhaps a wish to check the calibration of Fisher’s p-values.
In Ronald Fisher’s framework, the p-value measures strength of divergence from the hypothesized null…
Not quite. As Amrhein, Trafimow, and Greenland noted, the p-value also depends on model specification, which has many, many failure modes.
Yes, a small P-value may arise because the null hypothesis is false. But it can also mean that some mathematical aspect of the model was not correctly specified, that sampling was not a hundred percent random, that we accidentally switched the names of some factor levels, that we unintentionally, or intentionally, selected analyses that led to a small P-value (downward “P-hacking”), that we did not measure what we think we measured, or that a cable in our measuring device was loose…. And a large P-value may arise from mistakes and procedural errors, such as selecting analyses that led to a large P-value (upward P-hacking), or using a measurement so noisy that the relation of the measured construct to anything else is hopelessly obscured.
Also how do you propose we plan a study without NP-influenced ideas such as power analysis or precision analysis? As a consulting statistician, “how many subjects” is one of the most common reasons a scientist asks me for help.
I think sample-size calculations are over-rated. We should probably abandon them for observational studies and for experimental studies I do not really see how knowledge of the predicted long run proportion of rejected tested hypotheses given a fixed type I and II error really helps anyone or even the researcher. Perhaps, we should be pragmatic about such trials e.g. what sample size is required for the range of tested hypotheses under which the data would not be considered unusual to be no more than ±10%?.
The most sensible way I’ve found to compute sample size is to solve for the smallest N such that decisions made from posterior probabilities do not meaningfully vary over a spectrum of prior distributions. More here.
In response to my question, @s_doi and @f2harrell proposed alternate ways to estimate sample size. This is not the same as “abandoning” such calculations altogether, which would be ethically reckless, as explained at length by others elsewhere.
The ethical statistical practitioner makes informed recommendations for sample size and statistical practice methodology to avoid the use of excessive or inadequate numbers of subjects and excessive risk to subjects.
(Part 3 is also relevant to those of us involved in animal studies, and make a related point about sample size.)
Sample size calculations cannot be done with any measure of certainty because they require knowledge of the true effect in the study, which is always unknown (not only before but also after the study is conducted), and which, if known, would make conducting the study unnecessary [1]. Further, post hoc assessments of power cannot be done because they are deeply problematic (e.g., they are irrelevant and typically are biased and have large sampling variation) and thus cannot be calculated [17, 18, 19, 20]. When a power analysis is required, there is a strong motivation to assume a large enough effect size, that matches a power that one desires to be computed [21]. Therefore it is best that, instead, we should just include all participants who were available within the time frame of the study.
Having enough information to make meaningful decisions is an ethical requirement. Having a sample size calculation is not. The best science is done with sequential knowledge acquisition and no sample size calculation. We know as statisticians that the vast majority of sample size calculations are voodoo. Ethically what is needed is a commitment to ultimately have meaningful information or to abandon an uninformative study (or one with a harmful treatment) as early as possible.
Sample size calculations are not made to produce “the answer”. They are tools for the collaborative process of sample size planning - to explore with the investigator various hypothetical but plausible scenarios under different assumptions, possible tradeoffs, etc. Ultimately a choice is made, but never dictated by any one specific calculation, as usually a set of such calculations (or simulations) are needed for the discussions with the investigator. I have used precision analysis (an NP-influenced concept) in such settings. Precision analysis focuses on estimation and moves attention away from the straw-man null hypothesis, Type 1 error, etc. For example, consider the initial FDA guidance for industry on EUA for COVID-19 vaccines from summer 2020 (i.e., minimum VE of 0.50 with minimum lower confidence limit of 0.30). The guidance didn’t mention p-values anywhere. It seems to me that precision analysis is designed to address the type of reasoning expected in this scenario.