Relating S-values to other measures of statistical information

After reading this paper again:

I wanted to share some R code I wrote to improve my intuition on how information measures relate to other common statistical quantities (ie Z-scores).

# Converting among various measures of information and evidence
# Shannons, p values, Z_scores, Bayes Factor Bounds.

bayes_factor   <- function(p) {
	e      <- 2.71828
	if (p >= 1/e) { 
	  bayes_factor <- 1.00 
	} else {
	  bayes_factor <- 1/(-e*p*log(p))
	}
 }


shannon_table  <- function(lower, upper) {

bits           <- lower:upper
states         <- rep(0, length(bits))
p_val          <- rep(0, length(bits))
one_side_z     <- rep(0, length(bits))
two_side_z     <- rep(0, length(bits))
bayes_bound    <- rep(0, length(bits))

# vector for states
for (i in 1:length(bits)) {
	states[i] <- 2^bits[i]
 }

# vector for p_val
for (i in 1 : length(bits)) { 
       p_val[i] <- 1/states[i]
 }
# vector for one_side_z
for (i in 1 : length(bits)) {
      one_side_z[i] <- qnorm(1 - p_val[i])
 }

# vector for two_side_z
for (i in 1 : length(bits)) {
     two_side_z[i] <- qnorm(1 - (p_val[i] / 2 ))
}

# vector for bayes_bound
for (i in 1 : length(bits)) {
     bayes_bound[i] <- bayes_factor(p_val[i])
}


info_table <- data.frame(bits, states, p_val, one_side_z, two_side_z, bayes_bound)
print(info_table)        	
}

If you call the shannon_table function with 2 positive integer arguments (ie. bits of information), it will calculate:

  1. The range of unsigned values available for that number of bits 2^n
  2. A decimal that can be interpreted as a p-value \frac{1}{2^n}
  3. A one tailed Z-score (positive) \Phi(1-\frac{1} {2^n})
  4. A two tailed Z-score
  5. The Bayes Factor Bound \frac{1}{-e \times p \times ln(p)} for precise null hypotheses.

Here is the table for the range of 3 bits to 25 bits of information. To get a sense of scale, the Higgs Boson discovery had a Z score a bit higher than 5. That is about 22 bits of information.

bits states p_val one_side_z two_side_z bayes_bound
3 8 0.125 1.15034938037601 1.53412054435255 1.41530187313445
4 16 0.0625 1.53412054435255 1.86273186742165 2.12295280970167
5 32 0.03125 1.86273186742165 2.15387469406146 3.39672449552267
6 64 0.015625 2.15387469406146 2.4175590162365 5.66120749253779
7 128 0.0078125 2.4175590162365 2.66006746861746 9.70492713006477
8 256 0.00390625 2.66006746861746 2.88563491242676 16.9836224776134
9 512 0.001953125 2.88563491242676 3.09726907819878 30.1931066268682
10 1024 0.0009765625 3.09726907819878 3.29719334569196 54.3475919283627
11 2048 0.00048828125 3.29719334569196 3.48710410411443 98.8138035061141
12 4096 0.000244140625 3.48710410411443 3.66832928512132 181.158639761209
13 8192 0.0001220703125 3.66832928512132 3.84193068550191 334.446719559155
14 16384 6.103515625e-05 3.84193068550191 4.00877259416859 621.115336324146
15 32768 3.0517578125e-05 4.00877259416859 4.1695693233491 1159.41529447174
16 65536 1.52587890625e-05 4.1695693233491 4.32491904082604 2173.90367713451
17 131072 7.62939453125e-06 4.32491904082604 4.4753284246542 4092.05398048849
18 262144 3.814697265625e-06 4.4753284246542 4.62123100149925 7729.43529647826
19 524288 1.9073486328125e-06 4.62123100149925 4.76300103426781 14645.2458249062
20 1048576 9.5367431640625e-07 4.76300103426781 4.90096420796319 27825.9670673217
21 2097152 4.76837158203125e-07 4.90096420796319 5.03540596946393 53001.8420329938
22 4194304 2.38418579101562e-07 5.03540596946393 5.16657811972875 101185.334790261
23 8388608 1.19209289550781e-07 5.16657811972875 5.2947040848546 193571.944816151
24 16777216 5.96046447753906e-08 5.2947040848546 5.41998317491687 371012.894230956
25 33554432 2.98023223876953e-08 5.41998317491687 5.54259405780294 712344.756923436

If you look closely at the z scores, you can see that a 2 tailed test loses 1 bit of information.
The -epln(p) bound is explicitly related to the odds in favor of H_1 \neq H_0.

I would think a similar best case Bayes Factor for dividing hypotheses could be obtained by using the one sided p value (ie H_1 > H_0 vs. H_1 \leq H_0, giving the best case Bayes factor that the observed sign is the correct one, for any effect size (assuming a uniform prior).

3 Likes

Very interesting and to me (from your table) it seems that we are just debating the change in level of evidence against the null from clearly unacceptable to clearly acceptable. So let us say we fix clearly unacceptable as a prior probability against the null in terms of odds at say 5:100 then the question is at what likelihood ratio (based on the data) will the posterior odds be considered clearly acceptable evidence against the null (perhaps an odds of 2:1)? The likelihood ratio can easily be computed as (2-p)/p and thus the threshold will be when:
(0.05/0.95) * ((2-p)/p) >= 2
Interestingly the P works out to be 0.05 when this threshold is met so perhaps thinking about p values in this fashion may mitigate the interpretation related concerns we have been discussing?

1 Like

Blockquote
Very interesting and to me (from your table) it seems that we are just debating the change in level of evidence against the null from clearly unacceptable to clearly acceptable. So let us say we fix clearly unacceptable as a prior probability against the null in terms of odds

I actually used this type of reasoning to compare the data reported in a meta-analysis on ACL injury to the narrative section in this thread.

It would be accurate to say that this table has compressed a number of independent papers that touch upon the issue of information and evidence.

Aside from Sanderā€™s many writings on the topic that call for embedding statistical reasoning in information theoretical frameworks, I wanted to link that more directly to Bayesian and Fisherā€™s ideas, as described by Kulinskaya, Morganthaler, and Stadute in this book on meta-analysis:

R-cubed, thanks for the reference to that post - I didnā€™t realize someone already thought about this as I was just making an off the cuff observation. I guess when Bayarri et al say strength of the evidence in favor of H1 relative to H0 they also probably mean the change in level of evidence against the null from clearly unacceptable to clearly acceptable. I just fixed the prior odds at 5:100 and only if the alternative hypothesis was so unlikely do we require the strength of evidence to accept it to be huge. The question is how do we arrive at an acceptable prior odds? Also, is a likelihood ratio that depends on the data (or more precisely the p value) acceptable? I have suggested (2-p)/p but Bayarri suggests 1/(-ep(log p)) though I think (2-p)/p is intuitive I am sure someone on this blog will tell us why it is wrong :grinning:.

The Bayes factor bound is the most optimistic assessment of the information from the experiment that distinguishes between the 2 hypotheses.

But before calculating the posterior odds via that relationship to the p-value, we need to rank experiments in terms of their error rates. Berger, Bayarri, and Selke make the case that Bayes Factors can be interpreted in the frequentist sense in terms of relative errors, giving \frac{1 - \bar \beta}{\alpha} ā€“ the ratio of expected power over type I error.

We can see that this is correct by considering the ā€œexperimentā€ of coin flips. If we were to answer an empirical question simply by flipping coins, our expected power is 0.5, our \alpha is also 0.5, making our expected Bayes factor 1. That experiment is clearly useless if you want to learn anything.

Experiments have value only when 1 - \bar\beta \gt \alpha.

I have no idea why this POV is not taught in intro stats.

They had two likelihood ratios which they called Rpre and Rpost (R being rejection ratios - not sure why they used such confusing language). If I call these LRpre and LRpost that makes it clearer. Pre stands for pre-experimental and post for post-experimental. Agree that:
LRpre = (1-Ī²)/Ī±
therefore assuming 1-Ī² is fixed at 0.8 then
Ī± = Oprior Ɨ 0.05 (assuming the posterior odds for ā€œnominalā€ evidence against the null is fixed at 16:1)
Thus if Oprior = 5:100 then the threshold for significance in order to have the ā€œnominalā€ evidence against the null should be:
0.05 Ɨ 0.05 = 0.0025
The question is: where do we get a reliable estimate of Oprior from?

Blockquote
Thus if Oprior = 5:100 then the threshold for significance in order to have the ā€œnominalā€ evidence against the null should be:
0.05 Ɨ 0.05 = 0.0025

If we are using the prior odds format, letā€™s reduce 5:100 to 1:20 ā€“ IE your skeptical prior discounts ā€œsignificantā€ reports believing that only 1 in 21 ā€œsignificantā€ reports are valid (converting a prior odds to a probability).

In that sense, I find the term ā€œrejection ratioā€ very clear. We need to think of our \alpha level that seems credible before looking at the data, and stop calling P > 0.05 as ā€œno evidence.ā€ Enough ā€œno evidenceā€ studies can be aggregated (using the much maligned p value) to detect a signal when no individual study could.

You also need to set a target posterior odds. In my post, I used a posterior of 9:1, where after seeing the data, Iā€™d have at least a 9:1 posterior odds of the reported sign being correct in a particular study.

In a meta-analysis scenario where Iā€™m aggregating the signal from a number of studies, Iā€™d have the posterior odds of at least 9:1 that there was a signal in at least 1 study.

Using elementary algebra, we can solve for our \alpha given these assumptions:

Odds_{posterior} \approx Odds_{prior} \times Odds_{likelihood}
9 \approx 1/20 \times \frac{0.8}{\alpha}
\alpha \approx \frac{0.8}{180}
\alpha \approx 0.004 (one sided),

In terms of S-values, that is about 8 bits of information.

If we use your 16:1 as the posterior:

16 \approx 1/20 \times \frac{0.8}{\alpha}
\alpha \approx \frac{0.8}{16 \times 20}
\alpha \approx 0.0025 or about 9 bits of information.

Blockquote
Where do we get Oprior from?

I think Fisherā€™s insight was that we donā€™t need an entire prior distribution to make progress. By considering the sample size we are willing to spend (an indirect reference to NP power), we can derive the prior odds needed to change our ā€œrejectionā€ of H_1 to tentative ā€œacceptance.ā€ We donā€™t even need to be terribly exact or formal about it.

Addendum: here is another paper worth reading:

Thanks for the good references. Thinking about this, if we consider the p value from a hypothesis test as the threshold for positivity in a continuous test scenario, then it works out that the posterior odds in favor of H1 is simply (1 -p )/p and thus the posterior probability in favor of H1 is 1 - p as the Ī² cancels out. I am assuming that both prior and posterior odds in favor of H1 are conditional - i.e. prior odds is conditional on threshold p not met and posterior odds is conditional on threshold p met with a new LR (based on Ī± and Ī²) connecting the two. Thoughts?

Blockquote
if we consider the p value from a hypothesis test as the threshold for positivity in a continuous test scenario

I wanted to derive what Fisher might have been thinking about how a rational community of scientists should be using the P value as a measure of indirect evidence, and he was explicit (although inconsistent) that p values should be kept as continuous measures, and not dichotomous like N-P \alpha levels.

Blockquote
then it works out that the posterior odds in favor of H1 is simply (1 -p )/p

I think we need to clarify what H1 is. If H_1 is \theta \ne 0, it is a priori false as a point null has a prior probability of 0.

If H1 is \theta \leq 0 \lor \theta > 0 (or vice versa) I donā€™t think that Iā€™d use the term posterior odds unless you specify some prior. It would be valid if your prior odds were 1:1.

What Iā€™ve concluded from the Bayarri, Berger, and Selke Rejection Ratio paper is that we need to make a few assumptions regarding prior and posterior odds in order to obtain value from surprisals.

Blockquote
I am assuming that both prior and posterior odds in favor of H1 are conditional - i.e. prior odds is conditional on threshold p not met and posterior odds is conditional on threshold p met with a new LR (based on Ī± and Ī²) connecting the two. Thoughts?

Iā€™m going to need to think about this more, but my initial intuition is that we need to be explicit about either:

  1. Prior odds, Posterior odds, and Power in order to derive \alpha.
  2. Posterior Odds, Power, and p-value, in order to derive the prior that renders the study persuasive or not.

What I meant is consider a p value akin to a continuous diagnostic test variable (e.g. 25OHVitD with a threshold for deficiency/sufficiency) then:
prior odds in favor of H1 = Ī² āˆ• (1-Ī²) when threshold not met
LR = [(1-Ī²) / Ī±] Ć· [Ī² / (1 - Ī±)]
posterior odds in favor of H1 if threshold is met is then
= Ī² āˆ• (1-Ī²) Ɨ {[(1-Ī²) / Ī±] Ć· [Ī² / (1 - Ī±)]}
= (1- p) / p if we take the observed p to be the threshold Ī±

So clearly Ī² is irrelevant - is it not?

I think that only works out in your special case, I would not generalize too much from it.

Blockquote
So clearly Ī² is irrelevant - is it not?

Addendum: after thinking about this, I think your argument fails, as you didnā€™t constrain 1 - \beta > \alpha

If I set your test 1 - \beta = \alpha, the LR = 1 regardless of the outcome, and the test is useless.

Agree - I made the following mistake:
prior odds in favor of H1 ā‰  Ī² āˆ• (1-Ī²) when threshold not met
It is probably 1-NPV / NPV (in DTA parlance)
Hence Ī² is relevant and that makes sense

1 Like

Adding a paper by @sander for more mathematical background on this topic.

Another closely related paper related to replication. Interestingly, plugging in the 1 sided p value underestimates the probability the sign is correct, if their calculations are correct (which I have little reason to doubt).

van Zwet, EW, Goodman, SN. How large should the next study be? Predictive power and sample size requirements for replication studies. Statistics in Medicine. 2022; 1- 12. doi:10.1002/sim.9406

1 Like