It’s a tricky thing to write about and I may well have ‘miss-written’ in places! I have read it through but couldn’t find these possible errors. Please give the number of the suspect lines.
I am adhering to Harry Crane’s distinction between “academic” probabilities (which have no real world consequences to the reporter when wrong) and real probabilities, that result in true economic gain or loss. This is independent of the idea of verifying a probabilistic claim.
Harry Crane (2018). The Fundamental Principle of Probability. Researchers.One . https://researchers.one/articles/18.08.00013v1
My complaint is that p values themselves are not the appropriate scale when comparing the information contained in multiple studies when the point of reference (ie. “null hypothesis”) is not true (which it never is exactly). This has lead others to suggest different scales that prevent the confusion of model based frequency considerations (sampling distribution) of the experimental design, from the combined, scientific consideration of the information collected with prior information (Bayesian Posterior).
As I noted above, Kulinskaya, Staudte, and Morgenthaler advise reporting variance stabilized t-stats, along with the standard error (+/- 1) to emphasize the random nature of evidence in this perspective. When the sample size is large enough, these stats closely approximate the standard normal distribution, N(0,1) when the null is true. Any deviation from the null reference model is simply a shift from that distribution a.k.a the non-centrality parameter.
The merits of this proposal is connecting Fisher’s information theoretic perspective (post data reporting of a particular study) with Neyman-Pearson design considerations (ie. large sample results), without the cognitive pathologies noted in the literature for close to 100 years now. Frequentist statistics looks less like an ad hoc collection of algorithms in this case.
The other proposal by @Sander is the base 2 log transform of the p value, which provides an information measure in bits. This is merely a different scale compared Fisher’s natural log transform of the p-value for meta-analytic purposes.
Note that both transformations permit the valid combination of information from multiple experiments.
Regarding uniform priors:
https://discourse.datamethods.org/t/what-are-credible-priors-and-what-are-skeptical-priors/580
Then too a lot of Bayesians (e.g., Gelman) object to the uniform prior because it assigns higher prior probability to β falling outside any finite interval (−b,b) than to falling inside, no matter how large b; e.g., it appears to say that we think it more probable that OR = exp(β) > 100 or OR < 0.01 than 100>OR>0.01, which is absurd in almost every real application I’ve seen.
On a Bayesian interpretation of P values:
Sorry I couldn’t name specifics and am too far behind on other things to give it justice.
Thank you for you comments. I would point out that the probability of replication beyond some threshold after (A) an infinite number of observations or with the same number of observations (B) without or (C) with a P value at least as low as the P value in the original study are calculated using the same concepts as those used to calculate the number of observations to provide a power (e.g. of 80%). Both the latter and the probability of replication (and P value) are used to make practical decisions. This approach in my post also explains the mystery of why methods C appears to give such a low frequency of replication (e.g. 36%) as shown by the Open Science Collaboration.
The problem may be that I did not make it clear that the distribution of SEMs can be formed in 3 ways (I have added comments against these 3 different distributions in the original post):
- The distribution of probabilities of all possible true means conditional on the actually observed mean
- The distribution of the likelihoods of possible observed means conditional on any single true mean (e.g. the null hypothesis)
- The distribution of the likelihoods of observing the actual observed mean conditional on each of the possible true means
It is assumed (A) that that distribution 1 is Gaussian (or some other symmetrical distribution) with the observed mean. It is also assumed (B) that distribution 2 is Gaussian (or the same shaped distribution as in A) and has the same SEM but any single mean (e.g. the null hypothesis). Distributions 1 and 3 are assumed to be the same - with the same mean and SEM, so that when X_i = is any particular possible true mean and Y = the single actually observed mean, then p(X_i|Y) = p(Y|X_i) and so by Bayes rule, p(X_i) = p(Y) for any X_i and therefore p(X_i) = p(X_i+1) = p(Y). In other words, the latter are all the same so that the prior probability distributions of X_i are uniform and equal to p(Y). This guarantees that the prior probability of seeing any observed result above a value X is the same as the prior probability of any true result above a value X. It also guarantees that for any null hypothesis Xnull that p(≤Xnull|Y) = p(>Y|Xnull) and that p(>Xnull|Y) = p(≤Y|Xnull).
I don’t disagree with your arithmetic, I am only pointing out problems with the interpretation. As @Sander has pointed out numerous times, frequentist model probabilities do not capture all uncertainties, and this presentation can be misleading in that it discourages not only thinking about the sampling model, but the possibility of being fed intentionally misleading information.
Critical thinking needs to be enhanced with quantitative skills. To evaluate evidence already collected is an inherently Bayesian activity:
This commentary on an Efron paper from 2010, he states the following:
First, I disagree that frequentism has supplied a good set of working rules. Instead, I argue that frequentism has been a prime source of reckless overconfidence in many fields (especially but not only in the form of 0.05-level testing; … The most aggressive modeling is that which fixes unknown parameters at some known constant like zero (whence they disappear from the model and are forgotten), thus generating overconfident inferences and an illusion of simplicity; such practice is a hallmark of conventional frequentist applications in observational studies.
These probabilities (False Discovery Rates) often require the use of Bayes theorem in order to be computed, and that presents special problems. Once data are observed, it is the false discovery rates that are the relevant assessments of uncertainty. The original frequency properties of the study design - the error rates - are no longer relevant. Failure to distinguish between these evidential metric leads to circular reasoning and irresolvable confusion about the interpretation of results as statistical evidence.
A proposal deserving serious consideration that reconciles frequentist estimates with Bayesian scientific concerns is the Reverse Bayes methodology, originally proposed by IJ Good and resurrected by Robert Matthews and Leonhardt Held.
Matthews Robert A. J. (2018) Beyond ‘significance’: principles and practice of the Analysis of Credibility R. Soc. open sci. 5: 171047. 171047 link
Held L, Matthews R, Ott M, Pawel S. Reverse-Bayes methods for evidence assessment and research synthesis. Res Syn Meth. 2022; 13(3): 295-314. link
For an application, see this post:
https://discourse.datamethods.org/t/frequentist-vs-bayesian-debate-applied-to-real-world-data-analysis-any-philosophical-insights/1695/7?u=r_cubed
@HuwLlewelyn, can you take a look for example at section 7.1 here? I suspect that your original, highly interesting, post makes the leap of faith described there (even before your post starts to discuss replications), hence @f2harrell’s discomfort and why the probability distribution does not use Bayes rule. As alluded throughout this thread, this is indeed a topic that Fisher was very interested in. Curious to see how your approach addresses this leap of faith without priors?
I agree that I focus on only one aspect of interpreting data when I consider P values, confidence intervals and probabilities of replication. I consider only the error / random / sampling / stochastic / issues of the data in front me. I do not address the other issues causing uncertainty such as past research, the issues of methodology, bias, dishonesty or inaccuracy of reporting, the plausibility of the scientific hypotheses being investigated and so on. The same issues arise in diagnostic thinking, investigation and decisions applied to a single individual. To my mind these have to be dealt with differently and using different probabilistic models of reasoning to that of Bayes rule with an assumption of statistical independence.
I have always regarded P values and confidence intervals and probabilities of replication in this ‘stand-alone’ way and assume that in any replication study, variability due to the above issues have been eliminated by repeating a study in exactly the same way and with the same number of observations and documenting the results accurately. This is what the Open Science Collaboration did and it was this I modelled in post number 5.
As I understand section 7.1, it considers the issue of uniform prior probabilities. My ‘leap of faith’ was that I regard fitting a Gaussian or other distribution to the observed data by finding its mean and standard deviation (SD) as estimating directly the distribution of N possible probabilities of true values X_i conditional on the data mean Y and the SD when i = 0 to N (in a continuous Gaussian distribution, N approaches infinity). The assumption that X_i = some null hypothesis and assuming that is distribution and its SD and SEM is equal to that of the observed data implies immediately from Bayes rule that for any i, p(X_i) = p(X_1+1) = p(Y) when Y is the mean of the observed data. The latter implies a uniform distribution of all p(X_i) . In other words, my ‘leap of faith’ is to make the same assumption as Fisher’s et al made by implication!
Yup. That’s my sense as well. There is a whole world of ongoing methodology research related to all this, with a nice review here, and additional generalizations that can hold Bayes as a special case.
I would like to add a corollary to my earlier post of the 8th of February.
If 100 patients had been in a double-blind cross over randomized controlled trial and that 58 out of those 100 individuals had a BP higher on control than on treatment, then knowing only that an individual was one of those in the study, the probability conditional on the entry criterion of that individual patient having a BP higher on control than on treatment would be 58/100 = 0.58. The 95% confidence limits for this probability of 0.58 from the binomial theorem would be 0.483 and 0.677.
If the average BP difference between a pair of observations on treatment and control was 2mmHg and the standard deviation of the differences was 10 mmHg, then the area under the bell-shaped Gaussian distribution above 0 mmHg in Figure 1 in post 1 would be below 0.2 SD, corresponding to 58% of the total area of the Gaussian distribution. From this again we see a probability of any randomly selected study individual having a probability of 0.58 of having a BP difference greater than zero. However, the SEM of this mean would be 10/√100 = 1, so that the 95% confidence interval for the mean would be 0mmHg +/- 0.196 mmHg. At 0.2 SD, the 95% confidence interval would be from 0.004mmHg to 0.396 mmHg. For a probability of 0.58, this corresponds to 95% a confidence interval of 0.502 to 0.654.
Note that the 95% confidence interval of 58/100 = 0.58 from the binomial theorem is wider at 0.483 to 0.677. This was based on dichotomising the results into those greater or less than 2mmHg. However, by using the actual results (and according to @Stephen, not succumbing to the awful habit of dichotomising continuous data), we get a tighter 95% confidence interval. So does this mean that if we do have to derive a proportion by dichotomising measurement results, we should estimate the 95% confidence interval for that proportion by first estimating the 95% confidence intervals for the measurement results?
Great points Huw. The general way to say this is that when we have a statistical model we want to use maximum likelihood estimation, penalized MLE, or Bayesian modeling to get parameter estimates. When Y is BP, the MLE of a probability involving Y is a nonlinear function of the MLEs of the parameters of the continuous distribution. When Y is Gaussian the MLE of a probability of exceeding some cutoff is a function of the raw mean and SD, not of the proportion from dumbed-down data. There is a paper somewhere that computes the relative inefficiency of the proportion.
The general principle is that you don’t dumb-down data to make them look like the final estimate of interest. You analyze the raw data then restate the statistical parameters to get any clinical readout wanted.
I would like to suggest another corollary to this post 1: (https://discourse.datamethods.org/t/some-thoughts-on-uniform-prior-probabilities-when-estimating-p-values-and-confidence-intervals/7508?u=huwllewelyn). I would like to suggest that the replication crisis may be due to studies having insufficient power to detect replication with P values of ≤ 0.05 two sided or ≤ 0.025 one sided in the second (replicating) study.
In the Open Science Collaboration study, 97 different experiments in psychology were selected that had been planned and conducted well and where they had all resulted in two sided P values of P ≤ 0.05 two-sided. The average P value in their results was 0.028 two-sided. When all 97 they were repeated, only 35/97 (36.1%) were replicated with P values of ≤ 0.05 two-sided a second time.
Assume that another replication study had been conducted based on 97 trials but of the same nature as the study in post 1 above. At the planning stage an estimated ‘Bayesian’ distribution would have a standard deviation of 10mmHg and a mean BP differences of 2.2mmHg (the latter two estimates corresponding to a one-sided P value of 0.014). From this information the number of paired observations required to get an 80% power of getting a P value of ≤ 0.025 one sided in the first real study was about 163. However, the frequency of replication with these parameters was only 36.7%
The calculations below are based on the assumption that the prior probability distribution of possible true results of a study (i.e. if the study was continued until there were an infinite number of subjects) conditional on the universal set of all numbers and prior to any knowledge at all of the nature of the proposed study is uniform. This is in contrast to a Bayesian prior probability that is also based on knowledge of the proposed design of the study. The latter allows the parameters such as standard deviation and mean differences to be estimated. The uniform prior probability conditional on the universal set of real numbers means that the probability and likelihood distributions regarding the true results are the same.
The calculation of this replication frequency of 36.7% is based on the added effect of variation represented by the above three distributions. Each distribution is based on an estimated mean difference of 2.2mmHg, the SEM of 10/√163 = 0.783 so that the variance is 0.783^2= 0.613. These would be the parameters of an estimated ‘Bayesian’ probability distribution A of the expected true values conditional on the above parameters. Distribution B of the first study result conditional on each possible true value will be assumed to be the same as distribution A. The errors represented by these 2 distributions have to be summated, so their combined distribution will be the sum of their variances, which will be double the variance of distribution A: 2 x 0.613 = 1.226. The probability distribution of the estimated mean results of the second replicating study conditional on each possible result of the first study will also be the same as distribution A. The errors have to be summated again so that the resulting variance will be triple the variance of distribution A.: 3 x 0.613 = 1.839.
The Excel calculation of the probability of replication of the second study having a P value of 0.025 or less again based on a sample size of 163 and the above estimated Bayesian parameters of distribution A applied 3 times (see ‘*3’ in the Excel formula below) is therefore:
=NORMSDIST(2.2/(((10 /163 ^0.5)^2)*3 )^0.5+NORMSINV(0.025 )) = 0.367 (Equation 1)
For the purpose of replication, a sample size of 163 therefore gives rise to a severely underpowered study. However, if the sample size is tripled from 163 to 489 to deal with the summation effect of three distributions, then according to the above model, we can expect a power of 80% to detect a one sided P value of ≤ 0.025 during replication and a replication probability of 0.801 as follows:
=NORMSDIST(2.2/(((10 /489 ^0.5)^2)*3 )^0.5+NORMSINV(0.025 )) = 0.801 (Equation 2)
Based on the above assumptions, the probability of replication one and two sided will be the same. Therefore, I would like to suggest on the basis of the above argument that for a power of 80% to get replication with a two sided P value of ≤ 0.05 again, the Open Science Collaboration study would have required 3 times the number of samples.
Nice thinking. Don’t forget two simple things: Most initial studies are undersized, and p < 0.05 is not a lot of evidence.
I agree of course. If from the above reasoning, I get P=0.05 in a study with a difference of 2.0mmHg and SD = 10 but it was originally underpowered with only 100 observations, then the probability of replication is only 0.283:
NORMSDIST(1.96/(((10/100^0.5)^2)*2)^0.5+NORMSINV(0.05/2)) = 0.283 (Equation 3)
In this case P=0.05 was a fluke and unlikely to be replicated. The calculation is only based on 2 variances because any power calculation performed in the planning stage that estimated the result of the first study has been overtaken by the actual result of the first study. The probability of replication in the second study is therefore based only on the observed outcome of the first study and their (two) respective variances.
However, if I got P=0.05 from a study that had been well powered with 409 observations then the probability of replication would be 0.8:
NORMSDIST(1.96/(((10/409^0.5)^2)*2)^0.5+NORMSINV(0.05/2)) = 0.800 (Equation 4)
This is more solid result less due to chance and more probably replicated. Note that the P = 0.05, the difference from zero and SD are the same on both occasions so without estimating the probability of replication based on the number of observations on which it was based, the P value provides limited evidence. Statisticians are already aware intuitively of this nuance but perhaps the above reasoning will help get the nuance across to non-statisticians.
A Bayesian approach may give a different impression.
The estimated prior distribution of the expected result of the study makes a significant impression on these approaches too of course.
If I had done the power calculation to estimate the number of required observations using a subjective distribution with an estimated BP difference from zero of 1.96mmHg and a SD of 10, then the number of observations needed to get P≤0.05 in the replicating study, would have been 613:
(10/(((1.96/(NORMSINV(0.8)-NORMSINV(0.05/2)))^2)/3)^0.5)^2 = 612.9 (Equation 5)
If the resulting first study came up with a BP difference of 1.96mmHg, then by using this to estimate the probability of replication based two variances with 613 observations, we get a probability of replication of 0.913:
NORMSDIST(1.96/(((10/613^0.5)^2)*2)^0.5+NORMSINV(0.025)) = 0.929. (Equation 6)
If the result of the first study happened show a BP difference of 1.6mmHg based on 613 observations, then the probability of replication based on 2 variances is 0.801, indicating that a P value above 0.05 could be a useful result:
NORMSDIST(1.6/(((10/613^0.5)^2)*2)^0.5+NORMSINV(0.025)) = 0.801 (Equation 7)
Does this bear a resemblance to the usual Bayesian approach?
If I had done the conventional power calculation to estimate the number of required observations based on an estimated BP difference from zero of 2.197mmHg and a SD of 10, then the number of observations required for a power of 80% to get P≤0.05 for the first study would be 163:
(10/(((2.197/(NORMSINV(0.8)-NORMSINV(0.05/2)))^2)/1)^0.5)^2 = 162.6 (Equation 8)
If we assume the Bayesian prior distribution had the same or similar SD of 10 and difference from zero of 2.197 and P=0.028, then by combining the Bayesian prior with the actual result, we end up with a posterior distribution with similar variance and difference but twice the number of observations of 163+163 = 326. If we now use this Bayesian ‘posterior’ distribution result from the first study to estimate the probability of replication with a P≤0.05 in the second study using the 2 variance formula, we get:
NORMSDIST(2.197/(((10/326^0.5)^2)*2)^0.5+NORMSINV(0.025)) = 0.801 (Equation 9)
On this occasion, the ‘Bayesian’ reasoning gives the same result as Equation 2 in Post 19. A different prior Bayesian distribution would give a different result of course. Does this make sense?
If I understand, there are 2 remaining issues:
- you are still getting evidence for non-zero effects instead of the likely more relevant evidence for non-trivial effects when computing p-values
- an SD of 10 in the raw data is not relevant for the SD of the prior for an effect
I take your point. I tried to combine the estimated distribution as if it were a Bayesian prior distribution based on knowledge of the proposed study, with the actual result of the first study to get a posterior estimation. I agree that the reasoning doesn’t work in this setting. I will therefore use the term ‘Bayesian-like’ because I don’t intend to use the estimated distribution in a Bayesian manner by combining it with a likelihood distribution to get a posterior distribution. It is the following five approaches (A) to (E) that I am proposing as being legitimate:
(A) To estimate the probability of replication ‘x’ to get P≤y (1 sided) in a second study when the first study’s observations are known already: (i.e. (1) the number ‘n’ of observations made, (2) the observed difference ‘d’ of the mean from zero and (3) the observed standard deviation ‘s’). This approach is based on doubling (*2 in the formula below) the variance being calculated from the above 3 observations.
x = NORMSDIST(d/(((s/n^0.5)^2)*2)^0.5+NORMSINV(y)) Equation 10
(B) To estimate the number of observations ‘n’ needed for a power of ‘x’ to get P≤y (1 sided) in the first study from a ‘Bayesian-like’ prior distribution based on prior knowledge of the planned study that allows an estimate to be made of the difference of the mean from zero ‘d’ and an estimated standard deviation ‘s’. This calculation involves doubling the variance of the Bayesian-like prior distribution (applied by ‘/2’ in the formula below).
n = (s/(((d/(NORMSINV(x)-NORMSINV(y)))^2)/2)^0.5)^2 Equation 11
(C) To estimate the number of observations needed for a power of x to get P≤y 1-sided in the second study from a ‘Bayesian-like’ prior distribution based on an estimated difference of mean from zero ‘d’ and an estimated standard deviation ‘s’. This calculation involves tripling the variance of the Bayesian-like distribution (applied by ‘/3’ in the formula below).
n = (s/(((d/(NORMSINV(x)-NORMSINV(y/2)))^2)/3)^0.5)^2 Equation 12
(D) A ‘what if’ calculation based on the above Bayesian-like distribution by calculating the probability of replication ‘x’ of the first study based on doubling the variance of the Bayesian-like distribution (applied by ‘/2’ in the formula below) and by inserting various observation numbers ‘n’, various values of the difference of the mean from zero (d), the standard deviation (s) and the desired a 1-sided P value ‘y’, into the expression. This can be used for sensitivity analyses of the parameters of the Bayesian distribution.
x = NORMSDIST(d/(((s/n^0.5)^2)*2)^0.5+NORMSINV(y)) Equation 13
(E) A ‘what if’ calculation based on the above Bayesian-like distribution by calculating the probability of replication ‘x’ of the first study based on tripling the variance of the Bayesian-like distribution (applied by ‘/3’ in the formula below) and by inserting various observation numbers ‘n’, various values of the difference of the mean from zero (d), the standard deviation (s) and the desired a 1-sided P value ‘y’, into the expression. This can be used for sensitivity analyses of the parameters of the Bayesian distribution.
x = NORMSDIST(d/(((s/n^0.5)^2)*3)^0.5+NORMSINV(y)) Equation 14
I will address your interesting point about non trivial differences. The best diagram I have is with the example of a mean difference of 1.96mmHg BP difference and a SD of 10. Figure 1 can therefore represent scenario (A) when the number of observations was 100. In this case there was a probability of replication of 0.283 for a one sided P value:
=NORMSDIST(1.96/(((10/100^0.5)^2)*2)^0.5+NORMSINV(0.025)) = 0.283 Equation 15
This corresponds to the area to the right of arrow C of the distribution under the black unbroken line) where P≤0.025 in the replicating study (i.e. all BP differences ≥2.77mmHg).
By assuming that the prior probability conditional on the set of all numbers is uniform (i.e. prior to knowing the nature of the study), then when P=0.025, the probability of the true value being a BP difference of ≥0mm Hg is 0.975 (see the area to the right of arrow A in Figure 1 under the green dotted distribution). For a non trivial difference of 1mmHg BP we look at the area to the right of arrow B, where the probability of the true value being a BP difference of ≥1mm Hg is 0.831 and P=0.169 (1-0.831). The probability of getting the same result again is also 0.283.
If we move the arrow C to D (from a BP difference of 2.77mmHg to 3.77mmHg) then this BP difference of ≥3.77mmHg accounts for 10% of the results. They correspond to P≤0.003824 for the green dotted distribution at the broken black arrow D. The probability of the true value being a BP difference of ≥0mm Hg conditional on a result mean of 3.77mmHg is 0.996176 (1-0.003824). This is represented in Figure 1 by moving the red dotted distribution from a mean of 2.77mmHg (the big black arrow) so that the mean is 3.77mmHg (the small broken arrow D). However, the probability of the true value being a BP difference of ≥1mm Hg conditional on an observed mean of 3.77mmHg is 0.975. There is a probability of 0.100 that this will also be the case if the study is repeated (corresponding to the area under the black unbroken distribution to the right of arrow D).
NORMSDIST(1.96/(((10/100^0.5)^2)*2)^0.5+NORMSINV(0.003824))=0.100 Equation 16
Figure 1:
In conclusion, these arguments depend on a number of principles:
- The prior probability of each possible result of a study is uniform conditional on the universal set of rational numbers (and before the nature of a proposed study, its design etc is known). This means for example that the probability of a result being greater than zero after continuing a study until there is an infinite number of observations, equals 1-P if the null hypothesis is zero.
- The Bayesian-like prior probability distribution of all possible true results is a personal estimation conditional on knowledge of a proposed study and estimation of the distribution’s various parameters.
- During calculations of a probability of achieving a specified P value, of replication of a first study by a second study, estimation of statistical power or the number of observations required, this Bayesian-like distribution is not combined with real data to arrive at a posterior distribution, but its estimated variance is doubled or tripled.