After reading this paper again:
I wanted to share some R code I wrote to improve my intuition on how information measures relate to other common statistical quantities (ie Z-scores).
# Converting among various measures of information and evidence
# Shannons, p values, Z_scores, Bayes Factor Bounds.
bayes_factor <- function(p) {
e <- 2.71828
if (p >= 1/e) {
bayes_factor <- 1.00
} else {
bayes_factor <- 1/(-e*p*log(p))
}
}
shannon_table <- function(lower, upper) {
bits <- lower:upper
states <- rep(0, length(bits))
p_val <- rep(0, length(bits))
one_side_z <- rep(0, length(bits))
two_side_z <- rep(0, length(bits))
bayes_bound <- rep(0, length(bits))
# vector for states
for (i in 1:length(bits)) {
states[i] <- 2^bits[i]
}
# vector for p_val
for (i in 1 : length(bits)) {
p_val[i] <- 1/states[i]
}
# vector for one_side_z
for (i in 1 : length(bits)) {
one_side_z[i] <- qnorm(1 - p_val[i])
}
# vector for two_side_z
for (i in 1 : length(bits)) {
two_side_z[i] <- qnorm(1 - (p_val[i] / 2 ))
}
# vector for bayes_bound
for (i in 1 : length(bits)) {
bayes_bound[i] <- bayes_factor(p_val[i])
}
info_table <- data.frame(bits, states, p_val, one_side_z, two_side_z, bayes_bound)
print(info_table)
}
If you call the shannon_table function with 2 positive integer arguments (ie. bits of information), it will calculate:
- The range of unsigned values available for that number of bits 2^n
- A decimal that can be interpreted as a p-value \frac{1}{2^n}
- A one tailed Z-score (positive) \Phi(1-\frac{1} {2^n})
- A two tailed Z-score
- The Bayes Factor Bound \frac{1}{-e \times p \times ln(p)} for precise null hypotheses.
Here is the table for the range of 3 bits to 25 bits of information. To get a sense of scale, the Higgs Boson discovery had a Z score a bit higher than 5. That is about 22 bits of information.
bits | states | p_val | one_side_z | two_side_z | bayes_bound |
---|---|---|---|---|---|
3 | 8 | 0.125 | 1.15034938037601 | 1.53412054435255 | 1.41530187313445 |
4 | 16 | 0.0625 | 1.53412054435255 | 1.86273186742165 | 2.12295280970167 |
5 | 32 | 0.03125 | 1.86273186742165 | 2.15387469406146 | 3.39672449552267 |
6 | 64 | 0.015625 | 2.15387469406146 | 2.4175590162365 | 5.66120749253779 |
7 | 128 | 0.0078125 | 2.4175590162365 | 2.66006746861746 | 9.70492713006477 |
8 | 256 | 0.00390625 | 2.66006746861746 | 2.88563491242676 | 16.9836224776134 |
9 | 512 | 0.001953125 | 2.88563491242676 | 3.09726907819878 | 30.1931066268682 |
10 | 1024 | 0.0009765625 | 3.09726907819878 | 3.29719334569196 | 54.3475919283627 |
11 | 2048 | 0.00048828125 | 3.29719334569196 | 3.48710410411443 | 98.8138035061141 |
12 | 4096 | 0.000244140625 | 3.48710410411443 | 3.66832928512132 | 181.158639761209 |
13 | 8192 | 0.0001220703125 | 3.66832928512132 | 3.84193068550191 | 334.446719559155 |
14 | 16384 | 6.103515625e-05 | 3.84193068550191 | 4.00877259416859 | 621.115336324146 |
15 | 32768 | 3.0517578125e-05 | 4.00877259416859 | 4.1695693233491 | 1159.41529447174 |
16 | 65536 | 1.52587890625e-05 | 4.1695693233491 | 4.32491904082604 | 2173.90367713451 |
17 | 131072 | 7.62939453125e-06 | 4.32491904082604 | 4.4753284246542 | 4092.05398048849 |
18 | 262144 | 3.814697265625e-06 | 4.4753284246542 | 4.62123100149925 | 7729.43529647826 |
19 | 524288 | 1.9073486328125e-06 | 4.62123100149925 | 4.76300103426781 | 14645.2458249062 |
20 | 1048576 | 9.5367431640625e-07 | 4.76300103426781 | 4.90096420796319 | 27825.9670673217 |
21 | 2097152 | 4.76837158203125e-07 | 4.90096420796319 | 5.03540596946393 | 53001.8420329938 |
22 | 4194304 | 2.38418579101562e-07 | 5.03540596946393 | 5.16657811972875 | 101185.334790261 |
23 | 8388608 | 1.19209289550781e-07 | 5.16657811972875 | 5.2947040848546 | 193571.944816151 |
24 | 16777216 | 5.96046447753906e-08 | 5.2947040848546 | 5.41998317491687 | 371012.894230956 |
25 | 33554432 | 2.98023223876953e-08 | 5.41998317491687 | 5.54259405780294 | 712344.756923436 |
If you look closely at the z scores, you can see that a 2 tailed test loses 1 bit of information.
The -epln(p) bound is explicitly related to the odds in favor of H_1 \neq H_0.
I would think a similar best case Bayes Factor for dividing hypotheses could be obtained by using the one sided p value (ie H_1 > H_0 vs. H_1 \leq H_0, giving the best case Bayes factor that the observed sign is the correct one, for any effect size (assuming a uniform prior).