How to achieve a pragmatic but correct interpretation of p values in null hypothesis significance testing framework NHST

I’m struggling with a theoretical question in my undergraduate class within null hypothesis significance testing (NHST), and I imagine this forum could give me some advices and recommendations related to the this question.

Assuming that:

-Inferential statistics aim to estimate a parameter from a sample. If the sample has adequate properties, I can use all the inferential machinery of statistics.

-When we test hypotheses of differences between means, using t tests for example, the p value is a widely used measure to reject or fail to reject the null hypothesis.

  • Traditional frequentist statistics assume that null hypothesis is true.

-The formal definition of p values is the probability of observing a test statistic (in this case, the t value) with a value as extreme or even more extreme conditional on the null hypothesis being true.

Now I would like to know if you partially or fully agree with this sentence below. If not, how would you change it to become consistent

"Let’s say you want to know how many hours male and female students sleep at night. There are two different ways to conduct this research. The first is a census study, in which all students will be present. The second way is to conduct this study with a random sample of students and then generalize this result to the entire population. For several reasons, census research is not an option and we need to move to the second way. After carrying out the study, let’s say that we discovered that: On average, males sleep for 8 hours and females for 7 hours. The t t test was significant, with a p value of 0.02 (alpha was defined at 0.05). What can we conclude?

(1) Under the null hypothesis, we were expecting similar results;

(2) With this p value, there is strong evidence against the null hypothesis!

(3) Male students sleep more than female students in the population!

(4) I’m aware that the parameter is fix and my interval varies. Therefore, I’d expect find this value in 95% of studies over the long run.

(5) statistics provided a shortcut. It was not necessary to study the entire population to discover some of its results.

(6) In this scenario, we are able to do a “reverse engineering” and get all students to check this parameter. There are multiple scenarios with infinite population, in which these procedures are even more valuable!"

Null hypothesis significance testing is widely used in applied statistics, and the correct interpretation of the results is the outcome we all want as professors. References are very welcome!

This is theoretical question. If not suitable for this forum, please let me know.

Thank you.


The textbooks don’t really go over this so well, but the “hypothesis test” is a procedure (where p < \alpha) to make an educated guess about a course of action with a fixed error probability. This can be useful in the context of the “inductive behavior” of an individual agent, but the value of these individual decisions as evidence measures for a community of scientists, is questionable.

Another point: the textbook calculations condition on the null (act as if it is true), but we should not assume that this null is in any sense “true” (outside of certain special contexts). The null is a point of reference that can be made to (approximately) hold via randomization. A purist Bayesian might question this need for randomization, but will still comprehend the result.

A 2 sided p value can be interpreted as a measure of surprise — indirect evidence for an alternative (ie. the observed sign is the “true” sign of the population). This continuous interpretation of p values permits different studies to be combined, and is more fitting of the needs of a community of scientists.

Some key papers on p values I’ve found helpful:

The authors give a Bayesian perspective on p values that is critical for proper interpretation.

Senn describes Fisher’s motivation for calculating p values vs. the Bayesian use of prior distributions. For this simple problem, the amount of prior weight on the null makes a world of difference when testing a precise null, vs. a dividing one. But for frequentist calculations, it doesn’t matter all that much.

P values, S-values, and relations to measures of information

The link to Sander Greenland and Zad Rafi’s paper is also critical, and connects statistical methods to information theory. More useful than a single \alpha level or p value for a specific parameter is the set of intervals where \alpha ranges from 0 to 1. The set of “confidence intervals” has the equally misleading name of “confidence distribution”.

Shafer’s recent paper on framing hypothesis testing as gambling is a useful technique to think about statistical problems and methods.

FWIW, I like the perspective Shafer takes, but am not terribly enthusiastic about his proposed measure to do away with p values. A similar game semantic framework can be developed using the Greenland/Rafi s-value, along with Berger et. al Bayesian machinery.

After study of the first 2 papers, you will find that none of the commonly used “significance levels” constitute “strong evidence” when interpreted as Bayes factors or bits of information.