The Higgs Boson and the relationship between P values and the probability of replication

Deborah Mayo reminded us recently in her blog that 10 years ago a P value of ‘5 sigma’ was used in support of evidence for the existence of the Higgs Boson [10 years after the July 4 statistical discovery of the the Higgs & the value of negative results | Error Statistics Philosophy]. This is a summary of my comment on her blog.

I gather that the above ‘5 sigma’ corresponded to a two sided P value of about 0.00000059 or a one-sided P value of about 0.0000003. This suggests that in the case of the Higgs Boson data, the probability is 1- 0.0000003 = 0.9999997 of the study result being ‘replicated’ by being greater than the null hypothesis if an infinite number of observations were made to get the ‘true’ result. A one-sided P value of 0.0000003 also suggests that the observed mean result was 4.995 SEMs away from the null hypothesis. According to my understanding, if the experiment was repeated in exactly the same way, then the probability of getting a P value of 0.025 one sided or less the second time would be 0.94. The reasoning that led me to arrive at the above conclusions is as follows.

If the estimation was based on a Gaussian distribution of continuous variables then the prior probability of the possible true values and the possible observed values conditional on the universal set of continuous numbers would be uniform and the same for the possible observed and possible true values. We can therefore assume that the probability of a possible true value conditional on an observed value is equal to the likelihood of the same possible observed value conditional on the same true value. Therefore the probability of the true value being the same or more extreme than the null hypothesis is equal to the P value and that the probability of the true value being less extreme than the null hypothesis is equal to 1-P.

Instead of repeating the study with an infinite number of observations, if it were repeated with only the same number of observations then the variance of the observations would depend on two separate groups of observations and would be twice as great, the SEM being √2 = 1.1414 as great. The null hypothesis would now be 4.9950/1.1414 = 3.532 SEMs away from the null hypothesis. The probability of replication greater than the null hypothesis would now be lower at 0.998. However, if we expected to get a P value of 0.025 or less for the repeat study, then the second result would have to be 1.96 SEMs (i.e. an effect size of 1.96 SEMs) or more away from the null hypothesis at 3.532-1.960 = 1.572. The latter corresponds to a probability of replication of 0.94. Note that if the original P value had been 0.025, then the above calculation provides a probability of replication with a P value of 0.025 or less would only be 0.28. This corresponds to the ball park replication frequency of 36% found in replication studies [1].

Perhaps the best thing to do is take a ‘long view’ by expressing the probability of replication as the theoretical probability of a result falling within a specified range (e.g. less extreme than the null hypothesis) if the study was repeated with an infinite number of observations. This still leaves the question of what level of ‘long term replication’ should constitute a ‘statistically significant’ result. According to current custom it would be a probability of 0.975 for a one sided P value of 0.025 and a prediction interval of 95% for two sided P values.

A Bayesian prior probability is not conditional only on the universal set but also on personal informal evidence. In this sense a Bayesian prior is a posterior probability based on a personally estimated likelihood distribution and a uniform prior distribution conditional on the universal set of all continuous numbers. The second prior is then combined with another likelihood distribution based on data to create a second posterior distribution. The frequentist parallel is to combine two data sets based on identical methods by calculating their weighted mean and variance or calculating the product of their likelihoods at each baseline value and normalising [2]. The latter is also based on assumption of uniform priors, which is also made when calculating 95% prediction intervals. However, if a test result is based on the mean of several measurements then SEMs will be used to calculate prediction intervals, in the same way as they are used to calculate confidence intervals.

I would be grateful for comments.

References

  1. Open Science Collaboration (2015) Estimating the reproducibility of psychological science. Science; 349 (6251):aac4716.0

  2. Llewelyn H (2019) Replacing P-values with frequentist posterior probabilities of replication—When possible parameter values must have uniform marginal prior probabilities. PLoS ONE 14(2): e0212302. https://doi.org/10.1371/journal.pone.0212302s

2 Likes

I posted a thread that relates p-values, s–values (Shannon Information), and the Bayes Factor Bound in this thread:

I know of 2 recent papers on study design and replication conditional on a “significant” p-value:

van Zwet, EW, Goodman, SN (2022). How large should the next study be? Predictive power and sample size requirements for replication studies. Statistics in Medicine. 41( 16): 3090– 3101. link

van Zwet, E, Schwab, S, Senn, S (2021). The statistical properties of RCTs and a proposal for shrinkage. Statistics in Medicine. 2021; 40( 27): 6107– 6117. link

2 Likes

Thank you @R_cubed for your comment and pointing to these references. They appear to be consistent with the points that I made in my post. @Stephen Senn (one of the authors of your second reference) has discussed this issue in the past [1] after Steve Goodman had raised it [2]. Goodman suggested that if the estimated mean result from a first study with a 2 sided P value of 0.05 is assumed to be ‘true’ then there was probability of 0.5 that a second study of the same power would be ‘replicated’ by giving a 2 sided P value again of 0.05 or less. Goodman suggested adopting this approach as a replacement for P values but Stephen Senn disagreed. (I assume that a ‘true’ result is the average result obtained if an infinite number of measurements were made so that its variance is zero.)

There are potentially two interpretations for the null hypothesis: (1) The traditional approach is to estimate the likelihood of the observed average result or something more extreme conditional on the null hypothesis being ‘true’ and then trying to infer the truth of the null hypothesis. (2) An alternative approach uses the null hypothesis as a threshold to identify a range of ‘true’ results above or below the threshold. This range of true results may have two thresholds (e.g. confidence intervals). This is the approach that I have taken leading to probabilities of replication within a range of ‘true’ results.

This second approach can be applied to estimating the probability of satisfying the criterion for a diagnosis. For example, if the threshold for the diagnosis of diabetes mellitus is a ‘true’ HbA1c of over 47mmol/mol, and if the SD of the measurement error is 1.0 mmol/mol then an individual patient’s probability of a ‘true’ diagnosis of diabetes mellitus conditional on a HbA1c of 49mol/mol (i.e. 2 SD away from the threshold) is 0.975. In practice the diagnosis is simplified by being confirmed by two results over 47mmol/l a few days apart.

Trained statisticians are familiar and comfortable with interpretation (1) of P values and are unlikely to change. However, I think it would help scientists and doctors to translate P values into probabilities of replication. I think this would improve their understanding and lead to greater appreciation of statistical advice. It might also provide a part explanation for John Ioannides’s assertion that “Most published research findings are false” and the ‘replication crisis’.

References

  1. Senn S. Letter to the editor regarding A comment on replication, P‐values and evidence by S Goodman. Statistics in Medicine, 21: 2437- 2444.
  2. Goodman S. (1992) A comment on replication, P‐values and evidence. Statistics in Medicine, 11(7): 875–879.
1 Like