I would like to suggest another corollary to this post 1: (https://discourse.datamethods.org/t/some-thoughts-on-uniform-prior-probabilities-when-estimating-p-values-and-confidence-intervals/7508?u=huwllewelyn). I would like to suggest that the replication crisis may be due to studies having insufficient power to detect replication with P values of ≤ 0.05 two sided or ≤ 0.025 one sided in the second (replicating) study.
In the Open Science Collaboration study, 97 different experiments in psychology were selected that had been planned and conducted well and where they had all resulted in two sided P values of P ≤ 0.05 two-sided. The average P value in their results was 0.028 two-sided. When all 97 they were repeated, only 35/97 (36.1%) were replicated with P values of ≤ 0.05 two-sided a second time.
Assume that another replication study had been conducted based on 97 trials but of the same nature as the study in post 1 above. At the planning stage an estimated ‘Bayesian’ distribution would have a standard deviation of 10mmHg and a mean BP differences of 2.2mmHg (the latter two estimates corresponding to a one-sided P value of 0.014). From this information the number of paired observations required to get an 80% power of getting a P value of ≤ 0.025 one sided in the first real study was about 163. However, the frequency of replication with these parameters was only 36.7%
The calculations below are based on the assumption that the prior probability distribution of possible true results of a study (i.e. if the study was continued until there were an infinite number of subjects) conditional on the universal set of all numbers and prior to any knowledge at all of the nature of the proposed study is uniform. This is in contrast to a Bayesian prior probability that is also based on knowledge of the proposed design of the study. The latter allows the parameters such as standard deviation and mean differences to be estimated. The uniform prior probability conditional on the universal set of real numbers means that the probability and likelihood distributions regarding the true results are the same.
The calculation of this replication frequency of 36.7% is based on the added effect of variation represented by the above three distributions. Each distribution is based on an estimated mean difference of 2.2mmHg, the SEM of 10/√163 = 0.783 so that the variance is 0.783^2= 0.613. These would be the parameters of an estimated ‘Bayesian’ probability distribution A of the expected true values conditional on the above parameters. Distribution B of the first study result conditional on each possible true value will be assumed to be the same as distribution A. The errors represented by these 2 distributions have to be summated, so their combined distribution will be the sum of their variances, which will be double the variance of distribution A: 2 x 0.613 = 1.226. The probability distribution of the estimated mean results of the second replicating study conditional on each possible result of the first study will also be the same as distribution A. The errors have to be summated again so that the resulting variance will be triple the variance of distribution A.: 3 x 0.613 = 1.839.
The Excel calculation of the probability of replication of the second study having a P value of 0.025 or less again based on a sample size of 163 and the above estimated Bayesian parameters of distribution A applied 3 times (see ‘*3’ in the Excel formula below) is therefore:
=NORMSDIST(2.2/(((10 /163 ^0.5)^2)*3 )^0.5+NORMSINV(0.025 )) = 0.367 (Equation 1)
For the purpose of replication, a sample size of 163 therefore gives rise to a severely underpowered study. However, if the sample size is tripled from 163 to 489 to deal with the summation effect of three distributions, then according to the above model, we can expect a power of 80% to detect a one sided P value of ≤ 0.025 during replication and a replication probability of 0.801 as follows:
=NORMSDIST(2.2/(((10 /489 ^0.5)^2)*3 )^0.5+NORMSINV(0.025 )) = 0.801 (Equation 2)
Based on the above assumptions, the probability of replication one and two sided will be the same. Therefore, I would like to suggest on the basis of the above argument that for a power of 80% to get replication with a two sided P value of ≤ 0.05 again, the Open Science Collaboration study would have required 3 times the number of samples.