Proofs of Nulls: P > 0.05

Throughout the scientific literature, there are several examples of authors and editors interpreting nonsignificant results ( p > 0.05 ) as being evidence for the null hypothesis. This contrasts with the characteristics of frequentist hypothesis testing, where the test hypothesis (such as a null hypothesis) can only be rejected, but never accepted.

“By increasing the size of the experiment, we can render it more sensitive, meaning by this that it will allow of the detection of a lower degree of sensory discrimination… Since in every case the experiment is capable of disproving, but never of proving this hypothesis, we may say that the value of the experiment is increased whenever it permits the null hypothesis to be more readily disproved.” - Fisher, The Design of Experiments

Failure to reject the null hypothesis becomes proof for the null rather than a true suspension of judgement and can be seen with several examples such as the following:

Brown et al. 2018

Such misinterpretations also have a trickle down effect where they affect people who do not understand what concepts like statistical significance are. This includes policy makers, influential bloggers, and others. Take this example from a highly popular blog, Science-Based Medicine, interpreting the results of an observational study on EMFs and brain tumor risk. SBM writes,

"Overall, around 10% of study participants were exposed to RF while only 1% were exposed to IF-EMF. There was no clear evidence for a positive association between RF or IF-EMF and the brain tumors studied, with most results showing either no association or odds ratios (ORs) below 1.0. The largest adjusted ORs were obtained for cumulative exposure to RF magnetic fields (as A/m-years) in the highest exposed category (≥90th percentile) for the most recent exposure time window (1–4 years before the diagnosis or reference date) for both glioma, OR = 1.62 (95% confidence interval (CI): 0.86, 3.01) and meningioma (OR = 1.52, 95% CI: 0.65, 3.55).

Essentially, this is a negative study. The possible correlation between the highest and most recent exposure with brain tumors was not statistically significant. The authors conclude that this possible association deserves further research, which is what authors always say, but this needs to be put into context.

When looking for so many possible correlations, there is going to be a lot of statistical noise. In this study, some of the groups had risks less than one, which if taken literally means that the EMF had a protective effect. No one would conclude this, however. This is just the random noise in the data. Even statistical significance does not necessarily mean an effect is real, but it is a minimum threshold for the data being interesting. In these kinds of studies, and generally, a non-statistical trend should just be treated as negative."

While we lack any serious data, one could argue for the potential of such misinterpretations having catastrophic effects on health policy and public education.

Hoping that a discussion of this topic here can foster recommendations on how to address such problems in the literature, how to word corrections, and how to induce systematic changes for these widespread misinterpretations.

Initial Edits: Just some grammatical errors and formatting.


As a medical educator, I have an interest in improving how we teach interpretation of study results to medical students, residents, and fellow physicians.

To that extent, I have found widespread issues in many of our popular EBM textbooks when looking at results with p>0.05. I have discussed this on another post here: Errors in teaching appraisal of null results

Certainly, if an EBM textbook written by medical experts is making mistakes like this, it’s no wonder lay physicians like myself have similar issues.

1 Like

I will also add that the bizarre fixation on statistical significance for observational studies where there are no random mechanisms is ridiculous, as can be seen from the cited examples from above. Clearly these folks have never read Fisher, Cox, or Greenland.

“Randomization provides the key link between inferential statistics and causal parameters. Inferential statistics, such as P values, confidence intervals, and likelihood ratios, have very limited meaning in causal analysis when the mechanism of exposure assignment is largely unknown or is known to be nonrandom.”

Or thinking that high power + nonsignificance is support for the null hypothesis over an alternative


It might be useful to teach people equivalence testing, at the same moment as they are taught null-hypothesis significance testing. Equivalence testing immediately makes it obvious you can have non-significant and ‘equivalent’ results, and non-significant but ‘inconclusive’ results, and that p > .05 does not allow you to draw useful inferences without specifying a smallest effect size of interest.


I find p-value functions useful to illustrate how other test hypotheses yield even higher p-values.

Even if the person adheres to the dichotomisation of p-values, the curve shows that there are many more test hypotheses with p>0.05.

Failure to reject the null hypothesis becomes proof for the null rather than a true suspension of judgement and can be seen with several examples such as the following:

I thought Neyman had the idea of two (or even more, if I’m not wrong) competing hypotheses where you reject one and (thereby) accept the other one.

This sounds logical if you see statistics as a tool for decision making. Pressing button A because you did not reject H0 is the same decision as pressing button A because you accepted H0, I’d say.

FYI, a slight tangent:

This might not be the best example to use. Lots of concern from confounders in research of ASD and SSRI in pregnancy, especially as most data sets can’t assess depression severity, etc. Most agree based on cumulative studies and methods to date that their is insufficient evidence to suggest causal link between SSRI in pregnancy and ASD.

More here:

I think also that anyone cued to base rate would understand that. Familiarity with basic logic also would have alerted researchers to it. Nevertheless, I’m a little confused as the relevance of ‘specifying a smallest size of interest’. Do you mean that it corresponds in relevance by taking the negative log of the P-value –log2§, which yields something known as the Shannon information value or surprisal (s) value.

I may be experimenting with concepts or just clueless. lol

“smallest size of interest” refers to value assigned to boundry for an equivalence test.

In biomedical science it is analagous to the MCID, minimally clinically important difference.

1 Like

Thank you so much. Regards

I still think it’s a good example of misinterpretation of nonsignificance. Whether or not there is a causal relationship between antidepressants and autism seems like a difficult topic to fully discuss, but is clear that one cannot take the numbers published by the authors and conclude that there was no association whatsoever just because the interval estimate included the null.


The discussion section of the article makes several “proof of the null” arguments based on (1) comparison of outcomes to prior studies (2) analysis of outcomes with different methods attempting to control for confounding (3) sensitivity analysis (4) analysis of outcomes with different patient selection (sibling and recent users).

The authors do make several muddled comments about “statistical significance”, but a more favorable interpretation suggests their conclusion of “no association” was based largely on apriori beliefs, inability to adjust for all confounding factors, and concerns of selection/measurement bias.

Below is except of relevant section from paper. Reading it in full, I am also hesitant to say their is a statistical association despite the estimates as presented due to the unresolved confounding and measurement biases.

"To our knowledge, the current study is the first to use HDPS to try to balance exposure group differences in an evaluation of the association between in utero serotonergic antidepressant exposure and autism spectrum disorder. Although baseline characteristics were fairly balanced after inverse probability of treatment weighting using HDPS, imbalance remained with respect to specific psychiatric diagnoses and psychiatric emergency department visits, suggesting that the HDPS did not completely balance the groups on the potentially important confounders of psychiatric illness diagnosis and severity. This may explain the significant association between antidepressant exposure and autism spectrum disorder that was observed in the HDPS-matched sensitivity analysis and some of the subanalyses.

Even though HDPS approaches are a sophisticated method for attempting to account for confounders, no statistical technique can improve comparability of groups that do not sufficiently overlap in confounder distribution, nor can it recover information about variables not contained in the data set by themselves or through proxy representation. Furthermore, it is possible that the HDPS missed 1 or more key low-prevalence covariates because the semiautomated selection method first selects prevalent codes before assessing for associations between covariate and exposure or covariate and outcome.

The sibling and recent user analyses reported herein, by the nature of their designs, may have been better able to account for confounding by the underlying indication compared with the HDPS approach. Future studies can draw on this and similar studies to refine observational methods further to disentangle the roles of serotonergic antidepressants and maternal mental illness in their associations with child autism spectrum disorder and other important outcomes."

1 Like

Well, it’s a good thing I created an R package just to do that.


Do not conflate the idea of proving the null vs. having evidence in favor of the null. If we take NHST as being firmly rooted in Popperian science, experiments can never confirm hypotheses, but they can support them, or otherwise. This underscores a critical fallacy with interpreting p < 0.05 as disproving the null: It [the experiment] didn’t, the evidence is merely highly inconsistent with the null if the null is true.

Indeed, threshold based statistical testing is yet another example of dichotomania. Compare the p-value to the power of the study and just describe what the evidence says. That’s how Fisher wanted it.


Given all these issues, should we teach Null Hypothesis testing to medical graduate students in an introduction to stats course?
I think the discussions arrond the missuse and missinterpretations of NHTS are good. But at an advanced level.

I think teaching it does far more harm than good, and believe in the maxim “questions are better than hypotheses and estimates are better than questions”. The pushback will be that some test will require medical students to know the prevailing ways of doing things, so they will have to spend a lot of time learning it.


I think many people who interpret a high P-value as evidence for H0 may think about the set of hypotheses that you may aim to investigate as binary: there either is an effect, or there is not - and I think this comes from the classic H0 vs. H1 presentation.

Hence, if you don’t have evidence for some effect…then it must be no effect, right? If you make it clear that H0 comprises one specific hypothesis out of your infinite set of hypotheses, and H1 actually comprises all other specific hypotheses out of that set - and you realise that when you argue for H1 you are really just eliminating H0 (a specific hypothesis) from that set, it becomes very clear that a high P value is not evidence for H0 (a specific hypothesis).

But I think that this often gets lost when just looking at the binary representation of H0 vs. H1 - at least that’s how it was in my case.

But don’t take it for granted that hypotheses are good ideas in the first place. But to your point, Tukey and others have stated that because it represents a single point out of a continuum, you’d do better to just assume the null hypothesis is always false, and save a lot of time and money. Better: ask questions and estimate things including the probability of a non-trivial effect.

1 Like

MIT teaches it the other way around. Start with bayesian approach, save NHST for advanced students.

What a breath of fresh air that would bring were that adopted widely. Just think how clearly the forward information-flow thinking would be for the students, and how the prior affects the final results. This would set the stage for a great discussion of the hidden role for priors in NHST, and how background information needs to be brought into frequentist interpretations.