Some thoughts on uniform prior probabilities when estimating P values and confidence intervals

HuwLlewelyn · February 8, 2024, 8:52pm

I would be grateful for views about these thoughts!

It is important when teaching a topic to begin with a concept with which the student is already familiar. This helps the student to integrate the new information into what is known and to ‘understand’ it. For example, a medical student is told that 100 patients had been in a double-blind’ cross over randomised controlled trial and that 58 out of those 100 individuals had a BP higher on control than on treatment. Knowing only that a patient was one of those in the study, the probability of that patient having a BP difference greater than zero would be about 58/100 = 0.58. This shows a probability of 0.58 being estimated directly from the experience of 58/100 without involving Bayes rule (analogous to the approach of logistic regression). It would be the familiar way that anyone would form mental probabilities from the cumulative experience of being right or wrong about a series of predictions.

Distributions for explanation

Figure 1: The distributions of blood pressure differences in a cross-over RCT

If the average BP difference between a pair of observations treatment and control was 2mmHg and the standard deviation of the differences was 10 mm Hg, then the shaded area under the bell-shaped Gaussian distribution above 0 mmHg (i.e. 0.2 standard deviations below the mean of 2mmHg (shaded yellow above) would contain 58% of the total area. As the standard deviation of the distribution was 10mm Hg and the number of observations was 100, then the standard error of all the possible means would be 10/√100 = 1 and the 95% of the area under the tall narrow curve with a mean of 2mm Hg and SEM of 1mmHg would fall between 2 mmHg +/- 1.96 = 0.04mm Hg to 3.96mm Hg. Zero would therefore be just outside these limits. If the study were repeated very many times and the mean found each time, the proportion and probability of a mean being higher than zero would be 0.9772, and the probability being zero or lower would be 0.0228. Again, this probability is derived from a direct probability of an outcome conditional on an observation without using Bayes rule. (This represents distribution 1 in post no 11).

We now assume a ‘null hypothesis’ that the true difference between the average BPs on treatment and control was zero (i.e. ‘true’ if we had made an infinite number of observations). We assume that 100 selections were made at random from this true population of zero difference. We also assume that the distribution of the differences is the same as in the above study (i.e. the bell-shaped Gaussian distribution) has a standard deviation of 10mm Hg and a standard error of the mean of 1mm Hg. (This is based on distribution 2 in post number 11.) Based on this information we can assume that the likelihood of getting the observed mean difference of 2 mmHg or something more extreme (i.e. over 2mm Hg) is also 0.0228. It also means that if we assume that the true difference was 2mm Hg then the likelihood of observing the observed result of zero or lower is 0.0228, and above zero it is 0.9772. According to Bayes rule, this means that the prior probability of seeing any observed result above a value X is the same as the prior probability of any true result above a value X. This symmetry and uniformity apply to all the prior probabilities of all true results and all observed results. (This is based on distribution 3 in post number 11.)

The assumption that the directly estimated probability distribution arising from the study is the same as any likelihood distribution based on selecting patients at random from a population of assumed true value (e.g. the null hypothesis of zero) guarantees that scale used for the true and observed values are the same. This can be explained by the fact that the scale of values used for the study are a subset of the universal set of all numbers and that the prior probability conditional on that universal set is the same or uniform or for each of these true and observed values. This uniformity will apply to all studies using numerical values and therefore before any study is even considered. (This is also based on distribution 3 in post number 11.)

We might now explain to the student that a one-sided P value of 0.0228 is the same as the probability of the true result (after repeating the study with an infinite number of hypothetical observations) being the same or more extreme than the null hypothesis. Conversely the probability of the true result being less extreme than the null hypothesis is 1 – P = 1-0.0228 = 0.9772. If the one-sided P value had been 0.025, then it follows from the above that there is a probability of 0.95 that the true result will fall within the 95% confidence interval.

The Bayesian prior probability is different to the above prior probability conditional on the universal set of all numbers. The Bayesian prior will be estimated after designing the study and doing a thought experiment or pilot study to estimate what the distribution of possible results will be in an actual study conditional on background knowledge. This prior distribution can be regarded as a posterior distribution formed by combining a uniform prior distribution conditional on the universal set of all numbers with an estimated likelihood distribution of the thought study result or pilot study result conditional on all possible true values. Each of those latter likelihoods is then multiplied by the likelihood of observing the actual study result conditional on all possible true results. These products are then normalised to give the Bayesian posterior probability of each possible true result conditional on the combined evidence of the result of the Bayesian thought experiment or pilot study and the actual study result. These thought experiments are done to estimate the power required to conduct the actual experiment (e.g. an RCT). However, it is a matter of opinion whether the result of the thought experiment or pilot study should be combined with the result of the actual study.

R_cubed · February 8, 2024, 11:44pm

Why can’t we stop referring to Fisher’s post data p-values from a particular data set as a “probability”, when they are also correctly referred to as percentiles (under an assumed model)?

The probability interpretation is relevant before the experiment, if one were doing an honest Neyman-Pearson design, where \alpha is traded off against \beta for the experiment at hand. In that case, it is preferable to think in terms of Z scores, rather than probabilities. This hypothetical “probability” is arguably meaningless if one doesn’t plan to repeat the experiment.

For the justification of thinking about p-values in terms of Z scores, see:

Kulinskaya, Staudte, Morgenthaler (2008) Meta Analysis: A Guide to Calibrating and Combining Statistical Evidence p. xiv

Raymond Hubbard & M. J Bayarri (2003) Confusion Over Measures of Evidence (p’s) Versus Errors (α’s) in Classical Statistical Testing, The American Statistician, 57:3, 171-178, DOI: 10.1198/0003130031856

This paper by @Sander is also worth reading.

Related thread:

https://discourse.datamethods.org/t/relating-s-values-to-other-measures-of-statistical-information/4293

HuwLlewelyn · February 9, 2024, 9:53am

I agree. The experiment in question is repeating the actual study (that had been completed) an infinite number of times, each repeat study being on an infinite number of subjects (not on the original number of 100 as in the example). The latter estimation is done before getting the result of what is clearly a hypothetical experiment or as philosophers would describe it, as a “thought experiment”. I am estimating the frequency of getting a result (i.e. greater than no difference) after doing this thought experiment. I would not disagree with any of the points that you make about the use of Z scores and interpretation of P values (I use them in my calculations). However, I am trying to add another perspective of interpreting the same data but framed in the concept of “replication”.

The concept of replication is central to medical and scientific thinking. A clinician will be anxious to estimate how often fellow clinicians would concur with her or his single observation (its interpretation would be a separate matter). A scientist will have the same concern about the average of a number of observations. In the latter case, the methods used in the many repeat clinical examinations or many repeat studies would have to be identical but the numbers used in the repeat studies to get averages need not be the same. In my example, each thought study involved an infinite number of observations. However, an estimate can be made of the frequency (and deduced probability) of repeating the study an infinite number of times with the same number (e.g. 100) observations each time and also doing this and expecting to get a one-sided P value again of 0.025.

My calculations are based on the expertise of statisticians and their traditional concepts. However, I am merely trying to interpret and explain the results of their calculations in the light of traditional scientific and medical concepts. My hope is that this will lead to clinicians and scientists having a better understanding and appreciation of statisticians’ expertise, leading to better collaboration to improve standards of medical practise and research.

R_cubed · February 9, 2024, 2:41pm

This is an asymptotic result of an imaginary experiment. It is not a probability that anyone is willing to bet on. Asymptotic results are not relevant to a data set at hand. A p-value for a collected data set is a proportion or percentile–perfectly descriptive of a divergence from a reference point. If you wish to refer to a future experiment of the exact same design, conditional on the observed test statistic, one should quote an interval ie a p value of 0.05 is equivalent to a Z score of 1.96, +/- 1. converting back to the p value scale (one tailed) is 0.17 (1 - 0.83) to 0.002 (1-0.998)

https://discourse.datamethods.org/t/bootstrap-on-regression-models/4081/4?u=r_cubed

HuwLlewelyn · February 10, 2024, 11:48am

Thank you for your comments. You doubted whether a theoretical probability based on my reasoning could be a good bet (and therefore tested empirically)! It actually can be tested using the Open Science Collaboration data as follows.

The posterior probability distribution of true results has a mean of 1.96 mmHg and a SEM of 1. However if we postulate a distribution of differences for the result of the second study with a SEM of 1 conditional on each point on the distribution of true values, then the overall distribution of the second study conditional on the original observed result of 1.96 mmHg result will have a variance of 1 + 1 = 2 and the SEM will be √2 = 1.141. Zero difference will now lie at 1.96/1.141 = 1.386 mmHg away from the mean of 1.96 mmHg. This 1.386 is the Z score, so the probability of the second study being greater than zero mmHg conditional on the first study result of 1.96 mmHg is 0.917 (which is what Kileen estimated it to be with his P-rep).

The range of the distribution of differences that provides a P value of 0.025 in a repeat study must be at least 1.96 SEMs away from zero which is 1.96x1.141 = 2.772 mmHg away from zero. However, 2.772 mmHg is 2.773 mmHg– 1.96 mmHg = 0.81 mmHg away from the mean of 1.96mmHg. This 0.81 mmHg is 0.81/1.41 = 0.5471 SEMs away from the mean of 1.96mmHg. This Z score of 0.5471 suggests that 71.7% of the distribution is below 2.722mmHg and 28.3% of the distribution is above 2.772mmHg. In other words when the original P value is 0.025, the proportion of the distribution 1.96 SEMs or more away from zero is 28.3% to give a one-sided P value of 0.025 or less again. If the one-sided P value is osP, the proportion of repeat results to give a P value no greater than 0.025 one-sided in a repeat study is (using a shorter formula in an Excel spreadsheet): NORMSDIST(NORMSINV(1-osP)/2^0.5-1.96).

In the Open Science Collaboration the average P value from the original 97 articles was 0.028 two sided but only 36.1% (95% CI 26.6% to 46.2%) showed a two-sided P value of 0.05 or lower when repeated [1]. According to the above reasoning, for a one-sided P value of 0.028/2 = 0.014, the expected proportion showing a one-sided P value of 0.025 (or a two sided value of 0.05) in a repeat study is NORMSDIST(NORMSINV(1-0.014)/2^0.5-1.96) = 34.2%. This is consistent with the above observed result of the Open Science Collaboration study. This is therefore consistent with the theoretical probability of 1- one-sided P (e.g. 0.975) that a repeat study with an infinite number of observations will be replicated by having a mean greater than zero when the original P value was a one-sided P (e.g. 0.025). It also follows from this that the probability is 0.95 that the true result falls between the 95% confidence limits.

The assumptions made during the above calculations are similar to those used to estimate the number of observations needed to get the power (e.g. 80%) of a study required to get a P value (e.g. of 0.025 one sided or 0.05 two-sided). For example, let’s say we required a power of 0.8 for detecting a difference from zero with a one-sided P value of 0.025 or lower, and we assume the true mean to be exactly 1.96 mmHg and know that the standard deviation of the data is 10mmHg. (Note that on this occasion, we do not estimate the probability distribution of the true means first, but assume the true mean to be 1.96 mmHg). In keeping with previous reasoning, we expect the area under the curve of the distribution to give a one-sided P value of 0.025 or lower to be 80% of the total so that it is NORMSINV(0.8) = 0.84 SEM above the assumed true mean of 1.96 mmHg. As the mean is 1.96 SEM away from zero, this means that the line marking an area of 80% of the total area will have to be 0.84 +1.96 = 2.80 SEMs above zero difference. A P value of 0.025 corresponds to a mean of 1.96 mmHg away from zero so the required SEM for a power of 0.8 is therefore 1.96 mmHg/2.80 = 0.7 mmHg. As the SD of the data is 10mm Hg, then the number of observations required to provide a power of 0.8 is therefore the (SD/SEM)^2 = (10/0.7)^2 = 204. The power from 100 observations by the same calculation would be 50% of course.

The Open Science Collaboration. Estimating the reproducibility of psychological science. Science. 2015; 349 (6251):aac4716.0

f2harrell · February 10, 2024, 12:04pm

A very non-specific comment on Huw’s original post. I read it quickly but in at least two places I felt that the ideas of probabilities about true unknown values and the probabilities of observing sample means in a certain interval were being mixed. I may be wrong.

HuwLlewelyn · February 10, 2024, 12:18pm

It’s a tricky thing to write about and I may well have ‘miss-written’ in places! I have read it through but couldn’t find these possible errors. Please give the number of the suspect lines.

R_cubed · February 10, 2024, 2:34pm

I am adhering to Harry Crane’s distinction between “academic” probabilities (which have no real world consequences to the reporter when wrong) and real probabilities, that result in true economic gain or loss. This is independent of the idea of verifying a probabilistic claim.

Harry Crane (2018). The Fundamental Principle of Probability. Researchers.One . https://researchers.one/articles/18.08.00013v1

My complaint is that p values themselves are not the appropriate scale when comparing the information contained in multiple studies when the point of reference (ie. “null hypothesis”) is not true (which it never is exactly). This has lead others to suggest different scales that prevent the confusion of model based frequency considerations (sampling distribution) of the experimental design, from the combined, scientific consideration of the information collected with prior information (Bayesian Posterior).

As I noted above, Kulinskaya, Staudte, and Morgenthaler advise reporting variance stabilized t-stats, along with the standard error (+/- 1) to emphasize the random nature of evidence in this perspective. When the sample size is large enough, these stats closely approximate the standard normal distribution, N(0,1) when the null is true. Any deviation from the null reference model is simply a shift from that distribution a.k.a the non-centrality parameter.

The merits of this proposal is connecting Fisher’s information theoretic perspective (post data reporting of a particular study) with Neyman-Pearson design considerations (ie. large sample results), without the cognitive pathologies noted in the literature for close to 100 years now. Frequentist statistics looks less like an ad hoc collection of algorithms in this case.

The other proposal by @Sander is the base 2 log transform of the p value, which provides an information measure in bits. This is merely a different scale compared Fisher’s natural log transform of the p-value for meta-analytic purposes.

Note that both transformations permit the valid combination of information from multiple experiments.

Regarding uniform priors:

https://discourse.datamethods.org/t/what-are-credible-priors-and-what-are-skeptical-priors/580

Then too a lot of Bayesians (e.g., Gelman) object to the uniform prior because it assigns higher prior probability to β falling outside any finite interval (−b,b) than to falling inside, no matter how large b; e.g., it appears to say that we think it more probable that OR = exp(β) > 100 or OR < 0.01 than 100>OR>0.01, which is absurd in almost every real application I’ve seen.

On a Bayesian interpretation of P values:

f2harrell · February 10, 2024, 11:03pm

Sorry I couldn’t name specifics and am too far behind on other things to give it justice.

HuwLlewelyn · February 11, 2024, 9:24am

Thank you for you comments. I would point out that the probability of replication beyond some threshold after (A) an infinite number of observations or with the same number of observations (B) without or (C) with a P value at least as low as the P value in the original study are calculated using the same concepts as those used to calculate the number of observations to provide a power (e.g. of 80%). Both the latter and the probability of replication (and P value) are used to make practical decisions. This approach in my post also explains the mystery of why methods C appears to give such a low frequency of replication (e.g. 36%) as shown by the Open Science Collaboration.

HuwLlewelyn · February 11, 2024, 9:32am

The problem may be that I did not make it clear that the distribution of SEMs can be formed in 3 ways (I have added comments against these 3 different distributions in the original post):

The distribution of probabilities of all possible true means conditional on the actually observed mean
The distribution of the likelihoods of possible observed means conditional on any single true mean (e.g. the null hypothesis)
The distribution of the likelihoods of observing the actual observed mean conditional on each of the possible true means

It is assumed (A) that that distribution 1 is Gaussian (or some other symmetrical distribution) with the observed mean. It is also assumed (B) that distribution 2 is Gaussian (or the same shaped distribution as in A) and has the same SEM but any single mean (e.g. the null hypothesis). Distributions 1 and 3 are assumed to be the same - with the same mean and SEM, so that when X_i = is any particular possible true mean and Y = the single actually observed mean, then p(X_i|Y) = p(Y|X_i) and so by Bayes rule, p(X_i) = p(Y) for any X_i and therefore p(X_i) = p(X_i+1) = p(Y). In other words, the latter are all the same so that the prior probability distributions of X_i are uniform and equal to p(Y). This guarantees that the prior probability of seeing any observed result above a value X is the same as the prior probability of any true result above a value X. It also guarantees that for any null hypothesis Xnull that p(≤Xnull|Y) = p(>Y|Xnull) and that p(>Xnull|Y) = p(≤Y|Xnull).

R_cubed · February 11, 2024, 5:11pm

I don’t disagree with your arithmetic, I am only pointing out problems with the interpretation. As @Sander has pointed out numerous times, frequentist model probabilities do not capture all uncertainties, and this presentation can be misleading in that it discourages not only thinking about the sampling model, but the possibility of being fed intentionally misleading information.

Critical thinking needs to be enhanced with quantitative skills. To evaluate evidence already collected is an inherently Bayesian activity:

This commentary on an Efron paper from 2010, he states the following:

First, I disagree that frequentism has supplied a good set of working rules. Instead, I argue that frequentism has been a prime source of reckless overconfidence in many fields (especially but not only in the form of 0.05-level testing; … The most aggressive modeling is that which fixes unknown parameters at some known constant like zero (whence they disappear from the model and are forgotten), thus generating overconfident inferences and an illusion of simplicity; such practice is a hallmark of conventional frequentist applications in observational studies.

These probabilities (False Discovery Rates) often require the use of Bayes theorem in order to be computed, and that presents special problems. Once data are observed, it is the false discovery rates that are the relevant assessments of uncertainty. The original frequency properties of the study design - the error rates - are no longer relevant. Failure to distinguish between these evidential metric leads to circular reasoning and irresolvable confusion about the interpretation of results as statistical evidence.

A proposal deserving serious consideration that reconciles frequentist estimates with Bayesian scientific concerns is the Reverse Bayes methodology, originally proposed by IJ Good and resurrected by Robert Matthews and Leonhardt Held.

Matthews Robert A. J. (2018) Beyond ‘significance’: principles and practice of the Analysis of Credibility R. Soc. open sci. 5: 171047. 171047 link

Held L, Matthews R, Ott M, Pawel S. Reverse-Bayes methods for evidence assessment and research synthesis. Res Syn Meth. 2022; 13(3): 295-314. link

For an application, see this post:
https://discourse.datamethods.org/t/frequentist-vs-bayesian-debate-applied-to-real-world-data-analysis-any-philosophical-insights/1695/7?u=r_cubed

Pavlos_Msaouel · February 11, 2024, 5:38pm

@HuwLlewelyn, can you take a look for example at section 7.1 here? I suspect that your original, highly interesting, post makes the leap of faith described there (even before your post starts to discuss replications), hence @f2harrell’s discomfort and why the probability distribution does not use Bayes rule. As alluded throughout this thread, this is indeed a topic that Fisher was very interested in. Curious to see how your approach addresses this leap of faith without priors?

HuwLlewelyn · February 11, 2024, 9:48pm

I agree that I focus on only one aspect of interpreting data when I consider P values, confidence intervals and probabilities of replication. I consider only the error / random / sampling / stochastic / issues of the data in front me. I do not address the other issues causing uncertainty such as past research, the issues of methodology, bias, dishonesty or inaccuracy of reporting, the plausibility of the scientific hypotheses being investigated and so on. The same issues arise in diagnostic thinking, investigation and decisions applied to a single individual. To my mind these have to be dealt with differently and using different probabilistic models of reasoning to that of Bayes rule with an assumption of statistical independence.

I have always regarded P values and confidence intervals and probabilities of replication in this ‘stand-alone’ way and assume that in any replication study, variability due to the above issues have been eliminated by repeating a study in exactly the same way and with the same number of observations and documenting the results accurately. This is what the Open Science Collaboration did and it was this I modelled in post number 5.

HuwLlewelyn · February 11, 2024, 10:06pm

As I understand section 7.1, it considers the issue of uniform prior probabilities. My ‘leap of faith’ was that I regard fitting a Gaussian or other distribution to the observed data by finding its mean and standard deviation (SD) as estimating directly the distribution of N possible probabilities of true values X_i conditional on the data mean Y and the SD when i = 0 to N (in a continuous Gaussian distribution, N approaches infinity). The assumption that X_i = some null hypothesis and assuming that is distribution and its SD and SEM is equal to that of the observed data implies immediately from Bayes rule that for any i, p(X_i) = p(X_1+1) = p(Y) when Y is the mean of the observed data. The latter implies a uniform distribution of all p(X_i) . In other words, my ‘leap of faith’ is to make the same assumption as Fisher’s et al made by implication!

Pavlos_Msaouel · February 11, 2024, 10:14pm

Yup. That’s my sense as well. There is a whole world of ongoing methodology research related to all this, with a nice review here, and additional generalizations that can hold Bayes as a special case.

HuwLlewelyn · August 2, 2024, 10:43am

I would like to add a corollary to my earlier post of the 8th of February.

If 100 patients had been in a double-blind cross over randomized controlled trial and that 58 out of those 100 individuals had a BP higher on control than on treatment, then knowing only that an individual was one of those in the study, the probability conditional on the entry criterion of that individual patient having a BP higher on control than on treatment would be 58/100 = 0.58. The 95% confidence limits for this probability of 0.58 from the binomial theorem would be 0.483 and 0.677.

If the average BP difference between a pair of observations on treatment and control was 2mmHg and the standard deviation of the differences was 10 mmHg, then the area under the bell-shaped Gaussian distribution above 0 mmHg in Figure 1 in post 1 would be below 0.2 SD, corresponding to 58% of the total area of the Gaussian distribution. From this again we see a probability of any randomly selected study individual having a probability of 0.58 of having a BP difference greater than zero. However, the SEM of this mean would be 10/√100 = 1, so that the 95% confidence interval for the mean would be 0mmHg +/- 0.196 mmHg. At 0.2 SD, the 95% confidence interval would be from 0.004mmHg to 0.396 mmHg. For a probability of 0.58, this corresponds to 95% a confidence interval of 0.502 to 0.654.

Note that the 95% confidence interval of 58/100 = 0.58 from the binomial theorem is wider at 0.483 to 0.677. This was based on dichotomising the results into those greater or less than 2mmHg. However, by using the actual results (and according to @Stephen, not succumbing to the awful habit of dichotomising continuous data), we get a tighter 95% confidence interval. So does this mean that if we do have to derive a proportion by dichotomising measurement results, we should estimate the 95% confidence interval for that proportion by first estimating the 95% confidence intervals for the measurement results?

f2harrell · August 2, 2024, 1:08pm

Great points Huw. The general way to say this is that when we have a statistical model we want to use maximum likelihood estimation, penalized MLE, or Bayesian modeling to get parameter estimates. When Y is BP, the MLE of a probability involving Y is a nonlinear function of the MLEs of the parameters of the continuous distribution. When Y is Gaussian the MLE of a probability of exceeding some cutoff is a function of the raw mean and SD, not of the proportion from dumbed-down data. There is a paper somewhere that computes the relative inefficiency of the proportion.

The general principle is that you don’t dumb-down data to make them look like the final estimate of interest. You analyze the raw data then restate the statistical parameters to get any clinical readout wanted.

HuwLlewelyn · September 23, 2024, 7:05pm

I would like to suggest another corollary to this post 1: (https://discourse.datamethods.org/t/some-thoughts-on-uniform-prior-probabilities-when-estimating-p-values-and-confidence-intervals/7508?u=huwllewelyn). I would like to suggest that the replication crisis may be due to studies having insufficient power to detect replication with P values of ≤ 0.05 two sided or ≤ 0.025 one sided in the second (replicating) study.

In the Open Science Collaboration study, 97 different experiments in psychology were selected that had been planned and conducted well and where they had all resulted in two sided P values of P ≤ 0.05 two-sided. The average P value in their results was 0.028 two-sided. When all 97 they were repeated, only 35/97 (36.1%) were replicated with P values of ≤ 0.05 two-sided a second time.

Assume that another replication study had been conducted based on 97 trials but of the same nature as the study in post 1 above. At the planning stage an estimated ‘Bayesian’ distribution would have a standard deviation of 10mmHg and a mean BP differences of 2.2mmHg (the latter two estimates corresponding to a one-sided P value of 0.014). From this information the number of paired observations required to get an 80% power of getting a P value of ≤ 0.025 one sided in the first real study was about 163. However, the frequency of replication with these parameters was only 36.7%

The calculations below are based on the assumption that the prior probability distribution of possible true results of a study (i.e. if the study was continued until there were an infinite number of subjects) conditional on the universal set of all numbers and prior to any knowledge at all of the nature of the proposed study is uniform. This is in contrast to a Bayesian prior probability that is also based on knowledge of the proposed design of the study. The latter allows the parameters such as standard deviation and mean differences to be estimated. The uniform prior probability conditional on the universal set of real numbers means that the probability and likelihood distributions regarding the true results are the same.

The calculation of this replication frequency of 36.7% is based on the added effect of variation represented by the above three distributions. Each distribution is based on an estimated mean difference of 2.2mmHg, the SEM of 10/√163 = 0.783 so that the variance is 0.783^2= 0.613. These would be the parameters of an estimated ‘Bayesian’ probability distribution A of the expected true values conditional on the above parameters. Distribution B of the first study result conditional on each possible true value will be assumed to be the same as distribution A. The errors represented by these 2 distributions have to be summated, so their combined distribution will be the sum of their variances, which will be double the variance of distribution A: 2 x 0.613 = 1.226. The probability distribution of the estimated mean results of the second replicating study conditional on each possible result of the first study will also be the same as distribution A. The errors have to be summated again so that the resulting variance will be triple the variance of distribution A.: 3 x 0.613 = 1.839.

The Excel calculation of the probability of replication of the second study having a P value of 0.025 or less again based on a sample size of 163 and the above estimated Bayesian parameters of distribution A applied 3 times (see ‘*3’ in the Excel formula below) is therefore:
=NORMSDIST(2.2/(((10 /163 ^0.5)^2)*3 )^0.5+NORMSINV(0.025 )) = 0.367 (Equation 1)

For the purpose of replication, a sample size of 163 therefore gives rise to a severely underpowered study. However, if the sample size is tripled from 163 to 489 to deal with the summation effect of three distributions, then according to the above model, we can expect a power of 80% to detect a one sided P value of ≤ 0.025 during replication and a replication probability of 0.801 as follows:
=NORMSDIST(2.2/(((10 /489 ^0.5)^2)*3 )^0.5+NORMSINV(0.025 )) = 0.801 (Equation 2)

Based on the above assumptions, the probability of replication one and two sided will be the same. Therefore, I would like to suggest on the basis of the above argument that for a power of 80% to get replication with a two sided P value of ≤ 0.05 again, the Open Science Collaboration study would have required 3 times the number of samples.

f2harrell · September 24, 2024, 11:28am

Nice thinking. Don’t forget two simple things: Most initial studies are undersized, and p < 0.05 is not a lot of evidence.