Some thoughts on uniform prior probabilities when estimating P values and confidence intervals

I agree. The experiment in question is repeating the actual study (that had been completed) an infinite number of times, each repeat study being on an infinite number of subjects (not on the original number of 100 as in the example). The latter estimation is done before getting the result of what is clearly a hypothetical experiment or as philosophers would describe it, as a “thought experiment”. I am estimating the frequency of getting a result (i.e. greater than no difference) after doing this thought experiment. I would not disagree with any of the points that you make about the use of Z scores and interpretation of P values (I use them in my calculations). However, I am trying to add another perspective of interpreting the same data but framed in the concept of “replication”.

The concept of replication is central to medical and scientific thinking. A clinician will be anxious to estimate how often fellow clinicians would concur with her or his single observation (its interpretation would be a separate matter). A scientist will have the same concern about the average of a number of observations. In the latter case, the methods used in the many repeat clinical examinations or many repeat studies would have to be identical but the numbers used in the repeat studies to get averages need not be the same. In my example, each thought study involved an infinite number of observations. However, an estimate can be made of the frequency (and deduced probability) of repeating the study an infinite number of times with the same number (e.g. 100) observations each time and also doing this and expecting to get a one-sided P value again of 0.025.

My calculations are based on the expertise of statisticians and their traditional concepts. However, I am merely trying to interpret and explain the results of their calculations in the light of traditional scientific and medical concepts. My hope is that this will lead to clinicians and scientists having a better understanding and appreciation of statisticians’ expertise, leading to better collaboration to improve standards of medical practise and research.

2 Likes

This is an asymptotic result of an imaginary experiment. It is not a probability that anyone is willing to bet on. Asymptotic results are not relevant to a data set at hand. A p-value for a collected data set is a proportion or percentile–perfectly descriptive of a divergence from a reference point. If you wish to refer to a future experiment of the exact same design, conditional on the observed test statistic, one should quote an interval ie a p value of 0.05 is equivalent to a Z score of 1.96, +/- 1. converting back to the p value scale (one tailed) is 0.17 (1 - 0.83) to 0.002 (1-0.998)

https://discourse.datamethods.org/t/bootstrap-on-regression-models/4081/4?u=r_cubed

2 Likes

Thank you for your comments. You doubted whether a theoretical probability based on my reasoning could be a good bet (and therefore tested empirically)! It actually can be tested using the Open Science Collaboration data as follows.

The posterior probability distribution of true results has a mean of 1.96 mmHg and a SEM of 1. However if we postulate a distribution of differences for the result of the second study with a SEM of 1 conditional on each point on the distribution of true values, then the overall distribution of the second study conditional on the original observed result of 1.96 mmHg result will have a variance of 1 + 1 = 2 and the SEM will be √2 = 1.141. Zero difference will now lie at 1.96/1.141 = 1.386 mmHg away from the mean of 1.96 mmHg. This 1.386 is the Z score, so the probability of the second study being greater than zero mmHg conditional on the first study result of 1.96 mmHg is 0.917 (which is what Kileen estimated it to be with his P-rep).

The range of the distribution of differences that provides a P value of 0.025 in a repeat study must be at least 1.96 SEMs away from zero which is 1.96x1.141 = 2.772 mmHg away from zero. However, 2.772 mmHg is 2.773 mmHg– 1.96 mmHg = 0.81 mmHg away from the mean of 1.96mmHg. This 0.81 mmHg is 0.81/1.41 = 0.5471 SEMs away from the mean of 1.96mmHg. This Z score of 0.5471 suggests that 71.7% of the distribution is below 2.722mmHg and 28.3% of the distribution is above 2.772mmHg. In other words when the original P value is 0.025, the proportion of the distribution 1.96 SEMs or more away from zero is 28.3% to give a one-sided P value of 0.025 or less again. If the one-sided P value is osP, the proportion of repeat results to give a P value no greater than 0.025 one-sided in a repeat study is (using a shorter formula in an Excel spreadsheet): NORMSDIST(NORMSINV(1-osP)/2^0.5-1.96).

In the Open Science Collaboration the average P value from the original 97 articles was 0.028 two sided but only 36.1% (95% CI 26.6% to 46.2%) showed a two-sided P value of 0.05 or lower when repeated [1]. According to the above reasoning, for a one-sided P value of 0.028/2 = 0.014, the expected proportion showing a one-sided P value of 0.025 (or a two sided value of 0.05) in a repeat study is NORMSDIST(NORMSINV(1-0.014)/2^0.5-1.96) = 34.2%. This is consistent with the above observed result of the Open Science Collaboration study. This is therefore consistent with the theoretical probability of 1- one-sided P (e.g. 0.975) that a repeat study with an infinite number of observations will be replicated by having a mean greater than zero when the original P value was a one-sided P (e.g. 0.025). It also follows from this that the probability is 0.95 that the true result falls between the 95% confidence limits.

The assumptions made during the above calculations are similar to those used to estimate the number of observations needed to get the power (e.g. 80%) of a study required to get a P value (e.g. of 0.025 one sided or 0.05 two-sided). For example, let’s say we required a power of 0.8 for detecting a difference from zero with a one-sided P value of 0.025 or lower, and we assume the true mean to be exactly 1.96 mmHg and know that the standard deviation of the data is 10mmHg. (Note that on this occasion, we do not estimate the probability distribution of the true means first, but assume the true mean to be 1.96 mmHg). In keeping with previous reasoning, we expect the area under the curve of the distribution to give a one-sided P value of 0.025 or lower to be 80% of the total so that it is NORMSINV(0.8) = 0.84 SEM above the assumed true mean of 1.96 mmHg. As the mean is 1.96 SEM away from zero, this means that the line marking an area of 80% of the total area will have to be 0.84 +1.96 = 2.80 SEMs above zero difference. A P value of 0.025 corresponds to a mean of 1.96 mmHg away from zero so the required SEM for a power of 0.8 is therefore 1.96 mmHg/2.80 = 0.7 mmHg. As the SD of the data is 10mm Hg, then the number of observations required to provide a power of 0.8 is therefore the (SD/SEM)^2 = (10/0.7)^2 = 204. The power from 100 observations by the same calculation would be 50% of course.

  1. The Open Science Collaboration. Estimating the reproducibility of psychological science. Science. 2015; 349 (6251):aac4716.0
1 Like

A very non-specific comment on Huw’s original post. I read it quickly but in at least two places I felt that the ideas of probabilities about true unknown values and the probabilities of observing sample means in a certain interval were being mixed. I may be wrong.

2 Likes

It’s a tricky thing to write about and I may well have ‘miss-written’ in places! I have read it through but couldn’t find these possible errors. Please give the number of the suspect lines.

I am adhering to Harry Crane’s distinction between “academic” probabilities (which have no real world consequences to the reporter when wrong) and real probabilities, that result in true economic gain or loss. This is independent of the idea of verifying a probabilistic claim.

Harry Crane (2018). The Fundamental Principle of Probability. Researchers.One . https://researchers.one/articles/18.08.00013v1

My complaint is that p values themselves are not the appropriate scale when comparing the information contained in multiple studies when the point of reference (ie. “null hypothesis”) is not true (which it never is exactly). This has lead others to suggest different scales that prevent the confusion of model based frequency considerations (sampling distribution) of the experimental design, from the combined, scientific consideration of the information collected with prior information (Bayesian Posterior).

As I noted above, Kulinskaya, Staudte, and Morgenthaler advise reporting variance stabilized t-stats, along with the standard error (+/- 1) to emphasize the random nature of evidence in this perspective. When the sample size is large enough, these stats closely approximate the standard normal distribution, N(0,1) when the null is true. Any deviation from the null reference model is simply a shift from that distribution a.k.a the non-centrality parameter.

The merits of this proposal is connecting Fisher’s information theoretic perspective (post data reporting of a particular study) with Neyman-Pearson design considerations (ie. large sample results), without the cognitive pathologies noted in the literature for close to 100 years now. Frequentist statistics looks less like an ad hoc collection of algorithms in this case.

The other proposal by @Sander is the base 2 log transform of the p value, which provides an information measure in bits. This is merely a different scale compared Fisher’s natural log transform of the p-value for meta-analytic purposes.

Note that both transformations permit the valid combination of information from multiple experiments.

Regarding uniform priors:

https://discourse.datamethods.org/t/what-are-credible-priors-and-what-are-skeptical-priors/580

Then too a lot of Bayesians (e.g., Gelman) object to the uniform prior because it assigns higher prior probability to β falling outside any finite interval (−b,b) than to falling inside, no matter how large b; e.g., it appears to say that we think it more probable that OR = exp(β) > 100 or OR < 0.01 than 100>OR>0.01, which is absurd in almost every real application I’ve seen.

On a Bayesian interpretation of P values:

1 Like

Sorry I couldn’t name specifics and am too far behind on other things to give it justice.

Thank you for you comments. I would point out that the probability of replication beyond some threshold after (A) an infinite number of observations or with the same number of observations (B) without or (C) with a P value at least as low as the P value in the original study are calculated using the same concepts as those used to calculate the number of observations to provide a power (e.g. of 80%). Both the latter and the probability of replication (and P value) are used to make practical decisions. This approach in my post also explains the mystery of why methods C appears to give such a low frequency of replication (e.g. 36%) as shown by the Open Science Collaboration.

1 Like

The problem may be that I did not make it clear that the distribution of SEMs can be formed in 3 ways (I have added comments against these 3 different distributions in the original post):

  1. The distribution of probabilities of all possible true means conditional on the actually observed mean
  2. The distribution of the likelihoods of possible observed means conditional on any single true mean (e.g. the null hypothesis)
  3. The distribution of the likelihoods of observing the actual observed mean conditional on each of the possible true means

It is assumed (A) that that distribution 1 is Gaussian (or some other symmetrical distribution) with the observed mean. It is also assumed (B) that distribution 2 is Gaussian (or the same shaped distribution as in A) and has the same SEM but any single mean (e.g. the null hypothesis). Distributions 1 and 3 are assumed to be the same - with the same mean and SEM, so that when X_i = is any particular possible true mean and Y = the single actually observed mean, then p(X_i|Y) = p(Y|X_i) and so by Bayes rule, p(X_i) = p(Y) for any X_i and therefore p(X_i) = p(X_i+1) = p(Y). In other words, the latter are all the same so that the prior probability distributions of X_i are uniform and equal to p(Y). This guarantees that the prior probability of seeing any observed result above a value X is the same as the prior probability of any true result above a value X. It also guarantees that for any null hypothesis Xnull that p(≤Xnull|Y) = p(>Y|Xnull) and that p(>Xnull|Y) = p(≤Y|Xnull).

1 Like

I don’t disagree with your arithmetic, I am only pointing out problems with the interpretation. As @Sander has pointed out numerous times, frequentist model probabilities do not capture all uncertainties, and this presentation can be misleading in that it discourages not only thinking about the sampling model, but the possibility of being fed intentionally misleading information.

Critical thinking needs to be enhanced with quantitative skills. To evaluate evidence already collected is an inherently Bayesian activity:

This commentary on an Efron paper from 2010, he states the following:

First, I disagree that frequentism has supplied a good set of working rules. Instead, I argue that frequentism has been a prime source of reckless overconfidence in many fields (especially but not only in the form of 0.05-level testing;The most aggressive modeling is that which fixes unknown parameters at some known constant like zero (whence they disappear from the model and are forgotten), thus generating overconfident inferences and an illusion of simplicity; such practice is a hallmark of conventional frequentist applications in observational studies.

These probabilities (False Discovery Rates) often require the use of Bayes theorem in order to be computed, and that presents special problems. Once data are observed, it is the false discovery rates that are the relevant assessments of uncertainty. The original frequency properties of the study design - the error rates - are no longer relevant. Failure to distinguish between these evidential metric leads to circular reasoning and irresolvable confusion about the interpretation of results as statistical evidence.

A proposal deserving serious consideration that reconciles frequentist estimates with Bayesian scientific concerns is the Reverse Bayes methodology, originally proposed by IJ Good and resurrected by Robert Matthews and Leonhardt Held.

Matthews Robert A. J. (2018) Beyond ‘significance’: principles and practice of the Analysis of Credibility R. Soc. open sci. 5: 171047. 171047 link

Held L, Matthews R, Ott M, Pawel S. Reverse-Bayes methods for evidence assessment and research synthesis. Res Syn Meth. 2022; 13(3): 295-314. link

For an application, see this post:
https://discourse.datamethods.org/t/frequentist-vs-bayesian-debate-applied-to-real-world-data-analysis-any-philosophical-insights/1695/7?u=r_cubed

1 Like

@HuwLlewelyn, can you take a look for example at section 7.1 here? I suspect that your original, highly interesting, post makes the leap of faith described there (even before your post starts to discuss replications), hence @f2harrell’s discomfort and why the probability distribution does not use Bayes rule. As alluded throughout this thread, this is indeed a topic that Fisher was very interested in. Curious to see how your approach addresses this leap of faith without priors?

I agree that I focus on only one aspect of interpreting data when I consider P values, confidence intervals and probabilities of replication. I consider only the error / random / sampling / stochastic / issues of the data in front me. I do not address the other issues causing uncertainty such as past research, the issues of methodology, bias, dishonesty or inaccuracy of reporting, the plausibility of the scientific hypotheses being investigated and so on. The same issues arise in diagnostic thinking, investigation and decisions applied to a single individual. To my mind these have to be dealt with differently and using different probabilistic models of reasoning to that of Bayes rule with an assumption of statistical independence.

I have always regarded P values and confidence intervals and probabilities of replication in this ‘stand-alone’ way and assume that in any replication study, variability due to the above issues have been eliminated by repeating a study in exactly the same way and with the same number of observations and documenting the results accurately. This is what the Open Science Collaboration did and it was this I modelled in post number 5.

As I understand section 7.1, it considers the issue of uniform prior probabilities. My ‘leap of faith’ was that I regard fitting a Gaussian or other distribution to the observed data by finding its mean and standard deviation (SD) as estimating directly the distribution of N possible probabilities of true values X_i conditional on the data mean Y and the SD when i = 0 to N (in a continuous Gaussian distribution, N approaches infinity). The assumption that X_i = some null hypothesis and assuming that is distribution and its SD and SEM is equal to that of the observed data implies immediately from Bayes rule that for any i, p(X_i) = p(X_1+1) = p(Y) when Y is the mean of the observed data. The latter implies a uniform distribution of all p(X_i) . In other words, my ‘leap of faith’ is to make the same assumption as Fisher’s et al made by implication!

2 Likes

Yup. That’s my sense as well. There is a whole world of ongoing methodology research related to all this, with a nice review here, and additional generalizations that can hold Bayes as a special case.

2 Likes

I would like to add a corollary to my earlier post of the 8th of February.

If 100 patients had been in a double-blind cross over randomized controlled trial and that 58 out of those 100 individuals had a BP higher on control than on treatment, then knowing only that an individual was one of those in the study, the probability conditional on the entry criterion of that individual patient having a BP higher on control than on treatment would be 58/100 = 0.58. The 95% confidence limits for this probability of 0.58 from the binomial theorem would be 0.483 and 0.677.

If the average BP difference between a pair of observations on treatment and control was 2mmHg and the standard deviation of the differences was 10 mmHg, then the area under the bell-shaped Gaussian distribution above 0 mmHg in Figure 1 in post 1 would be below 0.2 SD, corresponding to 58% of the total area of the Gaussian distribution. From this again we see a probability of any randomly selected study individual having a probability of 0.58 of having a BP difference greater than zero. However, the SEM of this mean would be 10/√100 = 1, so that the 95% confidence interval for the mean would be 0mmHg +/- 0.196 mmHg. At 0.2 SD, the 95% confidence interval would be from 0.004mmHg to 0.396 mmHg. For a probability of 0.58, this corresponds to 95% a confidence interval of 0.502 to 0.654.

Note that the 95% confidence interval of 58/100 = 0.58 from the binomial theorem is wider at 0.483 to 0.677. This was based on dichotomising the results into those greater or less than 2mmHg. However, by using the actual results (and according to @Stephen, not succumbing to the awful habit of dichotomising continuous data), we get a tighter 95% confidence interval. So does this mean that if we do have to derive a proportion by dichotomising measurement results, we should estimate the 95% confidence interval for that proportion by first estimating the 95% confidence intervals for the measurement results?

1 Like

Great points Huw. The general way to say this is that when we have a statistical model we want to use maximum likelihood estimation, penalized MLE, or Bayesian modeling to get parameter estimates. When Y is BP, the MLE of a probability involving Y is a nonlinear function of the MLEs of the parameters of the continuous distribution. When Y is Gaussian the MLE of a probability of exceeding some cutoff is a function of the raw mean and SD, not of the proportion from dumbed-down data. There is a paper somewhere that computes the relative inefficiency of the proportion.

The general principle is that you don’t dumb-down data to make them look like the final estimate of interest. You analyze the raw data then restate the statistical parameters to get any clinical readout wanted.

3 Likes

I would like to suggest another corollary to this post 1: (https://discourse.datamethods.org/t/some-thoughts-on-uniform-prior-probabilities-when-estimating-p-values-and-confidence-intervals/7508?u=huwllewelyn). I would like to suggest that the replication crisis may be due to studies having insufficient power to detect replication with P values of ≤ 0.05 two sided or ≤ 0.025 one sided in the second (replicating) study.

In the Open Science Collaboration study, 97 different experiments in psychology were selected that had been planned and conducted well and where they had all resulted in two sided P values of P ≤ 0.05 two-sided. The average P value in their results was 0.028 two-sided. When all 97 they were repeated, only 35/97 (36.1%) were replicated with P values of ≤ 0.05 two-sided a second time.

Assume that another replication study had been conducted based on 97 trials but of the same nature as the study in post 1 above. At the planning stage an estimated ‘Bayesian’ distribution would have a standard deviation of 10mmHg and a mean BP differences of 2.2mmHg (the latter two estimates corresponding to a one-sided P value of 0.014). From this information the number of paired observations required to get an 80% power of getting a P value of ≤ 0.025 one sided in the first real study was about 163. However, the frequency of replication with these parameters was only 36.7%

The calculations below are based on the assumption that the prior probability distribution of possible true results of a study (i.e. if the study was continued until there were an infinite number of subjects) conditional on the universal set of all numbers and prior to any knowledge at all of the nature of the proposed study is uniform. This is in contrast to a Bayesian prior probability that is also based on knowledge of the proposed design of the study. The latter allows the parameters such as standard deviation and mean differences to be estimated. The uniform prior probability conditional on the universal set of real numbers means that the probability and likelihood distributions regarding the true results are the same.

The calculation of this replication frequency of 36.7% is based on the added effect of variation represented by the above three distributions. Each distribution is based on an estimated mean difference of 2.2mmHg, the SEM of 10/√163 = 0.783 so that the variance is 0.783^2= 0.613. These would be the parameters of an estimated ‘Bayesian’ probability distribution A of the expected true values conditional on the above parameters. Distribution B of the first study result conditional on each possible true value will be assumed to be the same as distribution A. The errors represented by these 2 distributions have to be summated, so their combined distribution will be the sum of their variances, which will be double the variance of distribution A: 2 x 0.613 = 1.226. The probability distribution of the estimated mean results of the second replicating study conditional on each possible result of the first study will also be the same as distribution A. The errors have to be summated again so that the resulting variance will be triple the variance of distribution A.: 3 x 0.613 = 1.839.

The Excel calculation of the probability of replication of the second study having a P value of 0.025 or less again based on a sample size of 163 and the above estimated Bayesian parameters of distribution A applied 3 times (see ‘*3’ in the Excel formula below) is therefore:
=NORMSDIST(2.2/(((10 /163 ^0.5)^2)*3 )^0.5+NORMSINV(0.025 )) = 0.367 (Equation 1)

For the purpose of replication, a sample size of 163 therefore gives rise to a severely underpowered study. However, if the sample size is tripled from 163 to 489 to deal with the summation effect of three distributions, then according to the above model, we can expect a power of 80% to detect a one sided P value of ≤ 0.025 during replication and a replication probability of 0.801 as follows:
=NORMSDIST(2.2/(((10 /489 ^0.5)^2)*3 )^0.5+NORMSINV(0.025 )) = 0.801 (Equation 2)

Based on the above assumptions, the probability of replication one and two sided will be the same. Therefore, I would like to suggest on the basis of the above argument that for a power of 80% to get replication with a two sided P value of ≤ 0.05 again, the Open Science Collaboration study would have required 3 times the number of samples.

1 Like

Nice thinking. Don’t forget two simple things: Most initial studies are undersized, and p < 0.05 is not a lot of evidence.

1 Like

I agree of course. If from the above reasoning, I get P=0.05 in a study with a difference of 2.0mmHg and SD = 10 but it was originally underpowered with only 100 observations, then the probability of replication is only 0.283:
NORMSDIST(1.96/(((10/100^0.5)^2)*2)^0.5+NORMSINV(0.05/2)) = 0.283 (Equation 3)
In this case P=0.05 was a fluke and unlikely to be replicated. The calculation is only based on 2 variances because any power calculation performed in the planning stage that estimated the result of the first study has been overtaken by the actual result of the first study. The probability of replication in the second study is therefore based only on the observed outcome of the first study and their (two) respective variances.

However, if I got P=0.05 from a study that had been well powered with 409 observations then the probability of replication would be 0.8:
NORMSDIST(1.96/(((10/409^0.5)^2)*2)^0.5+NORMSINV(0.05/2)) = 0.800 (Equation 4)
This is more solid result less due to chance and more probably replicated. Note that the P = 0.05, the difference from zero and SD are the same on both occasions so without estimating the probability of replication based on the number of observations on which it was based, the P value provides limited evidence. Statisticians are already aware intuitively of this nuance but perhaps the above reasoning will help get the nuance across to non-statisticians.

A Bayesian approach may give a different impression.