Some thoughts on uniform prior probabilities when estimating P values and confidence intervals

EvZ · June 16, 2025, 3:44pm

I call P( z_repl > 2.77 | z=1.96) = 0.283 the probability of replication with P≤0.025 one sided or P≤0.05 two sided when the null hypothesis is zero and the variance is 1+1 and the SEM is 1.414.

Why should the z-statistic of the replication exceed 2.77 instead of the usual 1.96?

By working with the SNR and the two z-statistics the SEM (which I called s) becomes irrelevant. All that matters is that if the SNR has the uniform distribution, then the conditional distribution of the SNR given z is normal with mean z and standard deviation 1. It then follows that the conditional distribution of z_repl given z is normal with mean z and standard deviation sqrt(2). The following are then true:

P( z_repl > 1.96 | z=1.96) = 0.5
P( z_repl > 0 | z=1.96) = 0.917
P( z_repl > 2.77 | z=1.96) = 0.283

The first is the probability that the replication study reaches statistical significance if the original study had p=0.025 (one-sided) or p=0.05 (two-sided). The second is the probability that the sign of the observed effect in the replication is the same as that in the original study. I fail to see your interest in the probability that z_repl exceeds 2.77.

HuwLlewelyn · June 16, 2025, 3:58pm

It is only when Z_repl is equal to or greater than 2.77 (1.96x1.414) based on a variance of 1+1 and SEM = √(1+1) = 1.414 to take account of variability in both the first and second (replicating) study so that the null hypothesis of zero yields a P≤0.05 two sided or P≤0.025 one sided.

PS.

This will only be true if you use an SEM of 1.414. not of 1.

EvZ · June 16, 2025, 4:48pm

With all due respect, I believe you are confused. If the null hypothesis is true (the effect is zero) then

P(z_repl > 2.77 | z=1.96, SNR=0) = P(z_repl > 2.77 | SNR=0) = 0.0028

This will only be true if you use an SEM of 1.414. not of 1.

No, it doesn’t depend on the SEM (which I called s). I assume only that the SNR has the uniform distribution and that the two z-statistics (z and z_repl) share the same SNR.

HuwLlewelyn · June 16, 2025, 5:38pm

The BP 2.77 mm Hg is 2.77-1.96 = 0.812 mm Hg away from the mean of 1.96mm Hg. Because the SEM is 1.414, this 0.812 mm Hg represents a Z score of -0.812/1.414 = -0.574. This represents a tail of 0.283 to our right of the black arrow for the ‘black’ distribution in the Figure below from my earlier post. That tail is not 0.0028 of the total area of the ‘black’ distribution.

EvZ · June 16, 2025, 6:26pm

Which null hypothesis do you want to test? Do you want to test if SNR=0 (assuming both the original and the replication studies have the same SNR) or do you want to test if both the original and the replication studies have the same SNR?

HuwLlewelyn · June 16, 2025, 7:07pm

When the observed result has a mean of 1.96 mm Hg a SD of 10 mm Hg and 100 observations were made, then P= 0.025 one sided and 0.05 two sided based on a null hypothesis of 0 mm Hg. By assuming a uniform prior, we can also estimate that the possible true means have a most probable true mean of 1.96 mm Hg and a Gaussian distribution with a SEM of 10/√100 = 1 for all the other possible true means.

If we were to repeat the study in an identical way a large number of times, we would expect a PROBABILITY distribution of possible true means with a most frequent mean being 1.96 mm Hg and a SEM of 1. For EACH of the many possible true means from the latter probability distribution, there would also be a LIKELIHOOD distribution of possible study results with a SEM of 1. The overall distribution of the repeat study results can be estimated with a convolution of the probability distribution and the likelihood distribution.

The overall distribution from this huge number of repeat studies for each possible true mean would have a variance of 1+1 and an SEM of 1.414. By using the same null hypothesis of zero but an overall distribution with an SEM of 1.414 (instead of 1), there is a probability of 0.283 of getting P≤0.025 one sided or P≤0.05 two sided. When the observed result in the second study was 2.77 mm Hg difference, then P = 0.025 one sided or 0.05 two sided exactly. When the mean difference for the second study is larger than 2.77 mmHg then P < 0.025 one sided or 0.05 two sided.

EvZ · June 17, 2025, 7:49am

The p-value is calculated assuming that the null hypothesis is true, which in this case is H0: SNR=0. Assuming also that the two studies have the same SNR, we have

P(z_repl > 2.77 | z=1.96, SNR=0) = P(z_repl > 2.77 | SNR=0) = 0.0028.

If we assume that the SNR has the (improper) uniform distribution, we have

P( z_repl > 2.77 | z=1.96) = 0.283.

Again assuming only that the two studies have the same SNR, we have - and this may be the source of your confusion -

P( z - z_repl > 2.77) = 0.025.

So, checking if |z - z_repl| > 2.77 is a level 5% test of the hypothesis that the two SNRs are the same.

Since you keep reiterating what you wrote in section 6.1 of your ArXiv paper, I don’t feel we’re making much progress. So, I’ll leave this disussion. As a final remark, I would like to stress that the uniform distribution is almost always a very poor choice for a prior.

HuwLlewelyn · June 17, 2025, 9:03am

OK. I have set out assumptions and deduced conclusions from them. You make different assumptions and not surprisingly, deduce different conclusions. My conclusions are consistent with observed frequencies of replication. Your conclusions and those of others based on your assumptions are not consistent with observed frequencies of replication and you and others are surprised about that and regard it as a replication crisis. I am not surprised and don’t regard it as a crisis because I can explain it.

I regard uniform priors as the prior probabilities of possible true results and possible observed results of any study conditional on the universal set of all numbers (i.e. before any information about the nature of the study is known). These priors are uniform. By contrast, a Bayesian prior or any other prior (e.g. based on a pilot study) is conditional on prior information about the study that is about to take place and as you say, cannot be assumed to be uniform.

Thank you for the discussion.

EvZ · June 17, 2025, 9:28am

The probabilities I stated in my previous message are based on your assumptions; the (improper) uniform prior for the SNR and two z-statistics (z and z_repl) that are the sum of the same SNR and independent standard normal errors. The rest is basic probability theory. I’m just being more precise than you about what those probabilites refer to.

The conclusions in my paper with Goodman from 2022 are based on data from the Cochrane Database of Systematic Reviews, and are therefore in agreement with observation.

I have no idea why you’d think I’m surprised about the observed frequencies of replication. I’m not.

HuwLlewelyn · June 17, 2025, 12:20pm

I think after all that we are in agreement.

I agree that you were probably not surprised by replication frequencies after performing the study described your paper with Goodman from 2022. However, you obtained the Z of the SNR and standard normal noise from a convolution of the distribution of the SNR and the standard normal density. This was calculated from the Cochrane data base as represented by the estimated density of the SNR in Figure 1 that was obtained by a de-convolution of the distribution of the SNR and the standard normal density by subtracting 1 from the variance of each of the mixture components.

Table 1: Comparison of predictive powers with Goodman’s 1992 estimates, uniform priors, post Cochrane analysis in the 2022 paper and from HL doubling the variance.

Column 1	Column 2	Column 3	Column 4	Column 5	Column 6
P value	Z value	Goodman’s 1992 paper	Uniform prior	Post Cochrane	HL with variance x 2
0.5	0.67		0.18	0.11	0.07
0.3	1.04		0.26	0.15	0.11
0.1	1.64	0.37	0.41	0.23	0.21
0.05	1.96	0.5	0.5	0.29	0.28
0.03	2.17	0.58	0.56	0.34	0.34
0.01	2.58	0.73	0.67	0.44	0.44
0.005	2.81	0.8	0.73	0.5	0.51
0.001	3.29	0.91	0.83	0.64	0.64

I got a very similar result by a convolution of the distributions of the observed study and the same expected distribution of equal variance of the second study, by doubling the variance of the observed study (compare Columns 5 and 6 in Table 1). Prior to the analysis of the Cochrane data, Goodman’s 2002 estimate of the probability of a statistically significant result in a duplicate experiment is shown in Column 3. This is similar to the result obtained by assuming a uniform prior as shown in Column 4. These overestimate the probabilities that I calculated and also that you and Goodman calculated for your 2022 paper.

These overestimates in Columns 3 and 4 is why people in general (and Goodman perhaps) were surprised by the comparatively low actual frequency of replication that precipitated the ‘replication crisis’. You and Goodman in your paper now show that this is what one should expect on the basis of the Cochrane data. However, I also show that this is what we should have expected based on my assumptions and reasoning without analysing the Cochrane data. So, basically, we are in agreement.

R_cubed · June 17, 2025, 4:34pm

In order to compute a “replication probability” of an already conducted study, what are you conditioning on?

I think @EvZ was pretty clear in the various types of hypotheses one could ask, their statistical assumptions, and the computations.

Perhaps someone could correct me if I’m wrong, but this is how I understand his calculations for a replication study where “replication” means the same sign, and a SNR \ge Z_1.

Addendum: The correction to my reasoning is found in the Steve Goodman paper mentioned above. A true Bayesian answer results in ~~shrinkage~~ a reduction of replication probabilities relative to the assumption Z_{init} = SNR_{true} even with a uniform prior; my reasoning assumes the initial estimate in Z_{init} = SNR_{true}, so I have re-named it a “naive frequentist” analysis.

Goodman, S. N. (1992). A comment on replication, p‐values and evidence. Statistics in medicine, 11(7), 875-879. (link)

Naive Frequentist Analysis

Before study 1:

Prior:~~(Uniform over \Re)~~ None;
Sampling Distribution N(0,1)
Predictive distribution: ~~Uniform~~ Normal with parameters estimated from data

After Study 1, before replication :

Prior: N(\theta, 1), \theta = 1.96 \pm 1,
Sampling Distribution N(0,1);
Predictive Distribution: N(\theta, 1), \theta = 1.96 \pm 1.

The N(\theta,1), \theta \ne 0 gives a (naive) 68% confidence interval for the SNR after the first study.

After the first study, the naive frequentist observation of the SNR (Z score) provides no shrinkage to the observed results, leading someone who was ignorant before seeing the data to conclude that the sign of the parameter is in the same direction as the sign of the estimate, and his/her “best guess” at the true SNR is the MLE of the collected data, which is equal to the 50th percentile (ie. p=0.5) under the assumed sampling model. This is only credible with a large amount of data.

Even with ~~shrinkage~~ a discounting of the probability of replication provided by the uniform prior, it is very unrealistic, as was noted here:

Then too a lot of Bayesians (e.g., Gelman) object to the uniform prior because it assigns higher prior probability to β falling outside any finite interval (−b,b) than to falling inside, no matter how large b; e.g., it appears to say that we think it more probable that OR = exp(β) > 100 or OR < 0.01 than 100>OR>0.01, which is absurd in almost every real application I’ve seen.

The empirically derived priors place most of the probability mass near the center of the distribution (closer to 0), and provide better ~~shrinkage~~ discounting of the replication probabilities, improving prediction of future studies.

If you also notice, this definition of a replication study can simultaneously have a Z_{rep}:

attain a p > Z_1
have a \frac{Z_{max} - Z_{min}}{\sqrt{2}} \lt 1 which would show the estimates are quite compatible.

This definition of “replication” ignores 1 side of a symmetrical distribution.

EvZ · June 17, 2025, 9:07pm

I’m surprised the OPs method gives numerically similar results as the method from my 2022 paper with Steve Goodman without using our data.

Apologies for the following long post. I hope at least it’s educational. To make the post self-contained, I’ll re-iterate my notation. Let beta denote the (unobserved) true effect effect of the treatment and let b be a normally distributed estimator with mean beta (i.e. b is unbiased) and (known) standard error s. Define the z-statistic z=b/s and the signal-to-noise ratio SNR=beta/s. The z-statistic has the normal distribution with mean SNR and standard deviation 1. In other words, z is the sum of SNR and an independent standard normal error. Let z_repl be the sum of the same SNR, but another independent standard normal error.

We’re interested in the probability of “successful replication” after having observed (or “conditional on”) the z-statistic of the original study. To be more specific, we want to evaluate

P(z_repl > 1.96 | z).

This probability depends on the distribution of the SNR. Let’s assume that the SNR has the normal distribution with mean zero and standard deviation sigma. Well known theory about bivariate normal distributions tells us that the conditional distribution of the SNR given z is again normal with mean z*sigma^2/(sigma^2 + 1) and variance sigma^2/(sigma^2 + 1).

It follows that the conditional distribution of z_repl given z is also normal with mean z*sigma^2/(sigma^2 + 1) and variance sigma^2/(sigma^2 + 1) + 1.

If I understand correctly, the OP assumed the uniform distribution for the SNR. That’s like a normal distribution with a very large sigma. In that case, the factor sigma^2/(sigma^2 + 1) is essentially 1. So, the conditional distribution of z_repl given z is normal with mean z and variance 1+1=2. For various values of z, the conditional probability that z_repl exceeds 1.96 is

z=c(0.67,1.04,1.64,1.96,2.17,2.58,2.81,3.29)
1- pnorm(1.96,z,sqrt(2))

0.18 0.26 0.41 0.50 0.56 0.67 0.73 0.83

In my paper with Steve Goodman, we used a distribution for the SNR that we estimated from a large collection of clinical trials. It’s not a normal distribution, but it can be reasonably approximated by a normal distribution with mean 0 and standard deviation sigma=1.5. The conditional probability that z_repl exceeds 1.96 becomes

m=z*1.5^2/(1.5^2+1)
v=1.5^2/(1.5^2+1) + 1
1- pnorm(1.96,m,sqrt(v))

0.13 0.17 0.26 0.32 0.36 0.45 0.50 0.60

Now let’s turn to the OPs method. He uses the uniform distribution for the SNR, but gets similar results to using N(0,1.5). How is that possible? Well, he seems to define replication success differently, namely

P(z_repl > 2.77 | z)

It is very unusual to define one-sided significance at level 0.025 in this way. Using this definition together with the uniform prior on the SNR, we get

1 - pnorm(2.77,z,sqrt(2))

0.07 0.11 0.21 0.28 0.34 0.45 0.51 0.64

However, this is the same as

1 - pnorm(1.96,z/sqrt(2),1)

0.07 0.11 0.21 0.28 0.34 0.45 0.51 0.64

At least we’re talking about the probability of exceeding 1.96 again! Now, 1/sqrt(2) = 0.71 happens to be very close to the shrinkage factor of 1.5^2/(1.5^2+1) = 0.70 which Goodman and I used. Morover, the conditional standard deviation which we used is also not too different from 1. I believe this explains the coincidental numerical agreement between our results.

HuwLlewelyn · June 17, 2025, 10:37pm

Thank you @R_cubed again for your comments.

I condition my probabilities of replication on the data of the study already conducted that are summarised by the mean (e.g. 1.96 mm Hg difference), SD (e.g. 10 mm Hg) and number of observations (e.g. 100) as described in my preprint.

I have explained already that they do not correspond to my assumptions and computations, which confirm the findings of his and Goodman’s paper of 2022.

Replication can be expressed in terms of getting P≤0.05 two-sided or P≤0.025 one-sided, or some other P values, in terms of the same sign, a 95% confidence interval (which does not ignore 1 side of a symmetrical distribution), etc. as they are all directly connected. The probability of replication is the same for all of these.

To help me to understand your many points about priors, please explain how your discussion on priors would change my calculations and the results in the form of probabilities of replication with numerical examples based on a mean (e.g. 1.96 mm Hg, SD (e.g. 10) and number of observations (e.g. 100) in the completed study and explain why your result may be correct and mine wrong.

HuwLlewelyn · June 17, 2025, 11:05pm

Thank you @EvZ again for your comments.

My assumptions and calculations were designed to predict what real data such as Cochrane’s would show in terms of a frequency of replication after an appropriate interpretation such as yours. My results were based on adding the variance of the distribution of the completed study to the expected variance of a planned repeat study.

A planned identical repeat study would involve the same variance of the first study. Therefore, doubling the variance of the first study to give the result of a convolution of the probability distribution of possible true means conditional on the data in the first study and the likelihood distribution of the second study conditional on each possible true mean, modelled accurately your interpretation of the Cochrane data. I suspect that the result of this convolution was very similar to the result of the convolution based on the Cochrane data described in your paper.

I have an interactive demonstration in Excel of all the above calculations and various distributions for any chosen target P-value etc., mean differences, standard deviations and sample sizes of the first and second study to help provide an intuitive understanding.

f2harrell · June 18, 2025, 5:31am

Editorial Note: This is some of the best back-and-forth I’ve seen — a great way to build understanding and to work towards convergence of assumptions and interpretations. Thank you all for engaging!

EvZ · June 18, 2025, 6:55am

It seems you have convinced yourself that you’ve discovered a universal identity,

P(z_repl > 1.96 | z) = 1 - pnorm(1.96,z/sqrt(2),1).

Unfortunateley, you are now unable and/or unwilling to see the problem, which is that the left-hand side depends on the distribution of the SNR while the right-hand side does not.

You write in your bio:

i am a hospital physician with a career long interest since I was a student of understanding my work in terms of mathematics, mainly probability theory. For this reason my concepts often differ from those whose training has been in mathematics and statistics.

I believe that an outside view can be very useful. However, when you think you’ve made a discovery outside your own field that all the experts missed, you have to ask yourself: “What did I know or which insight did I have that they didn’t have?” The mere fact that the variances of z and z_repl add up, is not something that has gone unnoticed.

HuwLlewelyn · June 18, 2025, 9:07am

Thank you for going to the trouble of commenting again.

What I did was postulate that a particular probability model might be able predict the frequency with which observations made on one occasion could be replicated in another situation, not try to discover a universal identity. My work over 40 years entailed postulating diagnoses (established models of disease) and scientific hypotheses (novel models of disease). In addition to this, I have experienced the process of replication in practice many times a day over 40 years in a medical setting. It happens when another physician tried to replicate my findings and diagnoses and when I tried to replicate another physician’s findings and diagnoses. This gave me an intuitive feel for the process of replication.

As the title of this thread (which I initiated in February 2024) implies, my postulated model hinges on the concept of prior probability. During random sampling from an unknown population with a true mean, the scale of the possible true means (e.g. in mm Hg of blood pressure) may be the same as the scale of the possible observations (again in mm Hg of blood pressure). This is also true of continuous variables. Now any numerical scale is a subset of the universal set of all numerical scales. Before we know the nature of the study the prior probabilities of any of these scales conditional on the universal set are uniform. This means for example that based on the same data, the Gaussian continuous probability distribution and Gaussian likelihood distribution based on the same data are identical.

NB. Bayesian and other prior distributions (e.g. based on pilot studies) are conditional on information about the study design before its results are known. These prior probabilities are not conditional on a universal set and will not be uniform as Gelman and others point out. These conditional ‘prior’ distributions are the first step in a chain of conditional probabilities, the next step in the chain being conditioned on the data of the completed study.

Based on the postulate in the first paragraph, I then postulated that the distribution of possible true means could be modelled by the mean and SEM (based on the SD and number of observations) for the completed study. Now based on my original postulate, this distribution should model all the possible means that would be discovered in a replicating study. For each of these possible true means in a myriad of repeat studies we would expect a distribution of possible observations in the second study with a mean the same as each possible true mean and the sample size of the second study. If the sample size on average is the same as in the completed first study, then the average variance will be the same. Their sum will be double the variance of the completed study. Therefore the distribution of possible observed results of all the studies is modelled by the convolution of two distributions with twice the variance of distribution conditional on the completed study.

My proposed postulates predicted the results of the 2022 study by you and Goodman and therefore they are supported by your data. However, my postulates also predict the result of the Open Science Collaboration study by summing 3 variances based on the variance used in the initial power calculation. This does not prove that my postulates were valid but the results are consistent with them, which is all one can expect from scientific studies.

I am very keen to understand why you think that there is a problem. Please express my postulates and reasoning in your notation and explain in words where you think the problem occurs, despite making reasonably correct estimates of real results.

EvZ · June 18, 2025, 11:34am

P(z_repl > 1.96 | z) depends on the distribution of the SNR. Suppose we’re in a field of research where all the trials have very low signal-to-noise ratios. For example, suppose the SNRs have a normal distribution with mean zero and standard deviation sigma=0.01. Then

P(z_repl > 1.96 | z=1.96)=0.025.

But now, suppose we’re in a field of research where all the trials have very high signal-to-noise ratios. For example, suppose the SNRs have a normal distribution with mean zero and standard deviation sigma=100. Then

P(z_repl > 1.96 | z=1.96)=0.5.

You’re claiming that when you observe z=1.96, the probability of replication is always 0.28. That’s obviously not true.

The fact that that 0.28 happens to agree with what Goodman and I found in the context of the trials from the Cochrane database is a coincidence because you didn’t take any specific information into account about clinical trials. Also, your 40 years of clinical experience did not make it into your rule.

Please express my postulates and reasoning in your notation

It’s really up to you to express your postulates. I’ll just say that don’t see any justification for your rule. It doesn’t even follow from any assumed distribution for the SNR.

HuwLlewelyn · June 18, 2025, 1:10pm

OK. Here we go:

When the replicating study has the same mean, standard deviation and sample size as the completed study, then:

P(z_repl >1.96 ∣ z=1.96, Var_repl = 2×Var_completed study)

When as in my simple example in the pre-print, the mean = 1.96 mm Hg, SD = 10 and sample size = 100 then Var_completed study = 10/√100 = 1 and:

P(z_repl >1.96 ∣ z=1.96, Var_repl = 2×1) = 0.283

EvZ · June 18, 2025, 1:58pm

Sorry, but your conclusion does not follow from your assumptions.

To make sure P(z_repl > 1.96 | z) is defined according to the usual rules of probability, you must specify a complete model. That means:

Specify the distribution of the SNR.
Assume z is the sum of SNR and a standard normal error e1.
Assume z_repl is the sum of SNR and a standard normal error e2.
e1 and e2 are independent.

You’ve only specified items 2, 3 and 4. That is not enough! All you’ve done is simply to claim

P(z_repl > 1.96 | z) = 1 - pnorm(1.96,z/sqrt(2),1).

You have not derived it from your assumptions.

Now, if you assume that the SNR has some normal distribution, then you can explicitly calculate P(z_repl > 1.96 | z). I showed the formulas in a previous post. If you assume some other distribution, the calculations will be more tricky.