Deborah Mayo reminded us recently in her blog that 10 years ago a P value of ‘5 sigma’ was used in support of evidence for the existence of the Higgs Boson [10 years after the July 4 statistical discovery of the the Higgs & the value of negative results | Error Statistics Philosophy]. This is a summary of my comment on her blog.
I gather that the above ‘5 sigma’ corresponded to a two sided P value of about 0.00000059 or a one-sided P value of about 0.0000003. This suggests that in the case of the Higgs Boson data, the probability is 1- 0.0000003 = 0.9999997 of the study result being ‘replicated’ by being greater than the null hypothesis if an infinite number of observations were made to get the ‘true’ result. A one-sided P value of 0.0000003 also suggests that the observed mean result was 4.995 SEMs away from the null hypothesis. According to my understanding, if the experiment was repeated in exactly the same way, then the probability of getting a P value of 0.025 one sided or less the second time would be 0.94. The reasoning that led me to arrive at the above conclusions is as follows.
If the estimation was based on a Gaussian distribution of continuous variables then the prior probability of the possible true values and the possible observed values conditional on the universal set of continuous numbers would be uniform and the same for the possible observed and possible true values. We can therefore assume that the probability of a possible true value conditional on an observed value is equal to the likelihood of the same possible observed value conditional on the same true value. Therefore the probability of the true value being the same or more extreme than the null hypothesis is equal to the P value and that the probability of the true value being less extreme than the null hypothesis is equal to 1-P.
Instead of repeating the study with an infinite number of observations, if it were repeated with only the same number of observations then the variance of the observations would depend on two separate groups of observations and would be twice as great, the SEM being √2 = 1.1414 as great. The null hypothesis would now be 4.9950/1.1414 = 3.532 SEMs away from the null hypothesis. The probability of replication greater than the null hypothesis would now be lower at 0.998. However, if we expected to get a P value of 0.025 or less for the repeat study, then the second result would have to be 1.96 SEMs (i.e. an effect size of 1.96 SEMs) or more away from the null hypothesis at 3.532-1.960 = 1.572. The latter corresponds to a probability of replication of 0.94. Note that if the original P value had been 0.025, then the above calculation provides a probability of replication with a P value of 0.025 or less would only be 0.28. This corresponds to the ball park replication frequency of 36% found in replication studies [1].
Perhaps the best thing to do is take a ‘long view’ by expressing the probability of replication as the theoretical probability of a result falling within a specified range (e.g. less extreme than the null hypothesis) if the study was repeated with an infinite number of observations. This still leaves the question of what level of ‘long term replication’ should constitute a ‘statistically significant’ result. According to current custom it would be a probability of 0.975 for a one sided P value of 0.025 and a prediction interval of 95% for two sided P values.
A Bayesian prior probability is not conditional only on the universal set but also on personal informal evidence. In this sense a Bayesian prior is a posterior probability based on a personally estimated likelihood distribution and a uniform prior distribution conditional on the universal set of all continuous numbers. The second prior is then combined with another likelihood distribution based on data to create a second posterior distribution. The frequentist parallel is to combine two data sets based on identical methods by calculating their weighted mean and variance or calculating the product of their likelihoods at each baseline value and normalising [2]. The latter is also based on assumption of uniform priors, which is also made when calculating 95% prediction intervals. However, if a test result is based on the mean of several measurements then SEMs will be used to calculate prediction intervals, in the same way as they are used to calculate confidence intervals.
I would be grateful for comments.
References
-
Open Science Collaboration (2015) Estimating the reproducibility of psychological science. Science; 349 (6251):aac4716.0
-
Llewelyn H (2019) Replacing P-values with frequentist posterior probabilities of replication—When possible parameter values must have uniform marginal prior probabilities. PLoS ONE 14(2): e0212302. https://doi.org/10.1371/journal.pone.0212302s