Blockquote
The ANDROMEDA-SHOCK trial discussion on this platform discussed how a Bayesian approach might allow a more nuanced interpretation of smaller RCTs whose results don’t reach “statistical significance,” but where early data seems to be “leaning” in one direction.
While I agree with our host that Bayesian methods are generally preferable, there are simple transforms of p values that can minimize this error. I took a recommendation of Sander Greenland seriously, and started studying p value transformations, also known as “omnibus significance tests” or "p value combination methods."
By following the logic of testing methods to this point, I ultimately have to agree with Professor Harrell’s recommendation that the precision of the estimate is more informative than emphasis on power of a test.
Blockquote
Basing sample size calculations on the margin of error can lead to a study that gives scientifically relevant results, even if the results are not statistically significant. [emphasis in original – BBR Section 6.7]
The more I compare and contrast the theory of p values, with their actual use in practice, the more difficult I find it to accept research summaries of “positive” and “negative” trials.
I had a colleague ask me “Why is it that I can find hundreds of studies demonstrating positive effects (often with sample sizes in the hundreds), while a meta-analysis will reject the effect?” This was in the exercise science and physical rehabilitation literature.
This is a complicated issue. It appears 2 errors are being made:
- the informal use of improper vote count procedures based on significance tests, and
- improper understanding of hypothesis tests and confidence intervals (at the meta-analysis level).
I wrote the following 4 R functions. To keep things simple, the simulations assume a fixed effect for all studies. It would be interesting to simulate a random effects distribution of P values.
The first 2 to simulate the behavior of p values under the null and under an alternative of “moderate” effect: ie a standardized effect size of 0.53 in the code below. The median p value will be between 0.25 and 0.75 when the null is true.
1. P values under a true (point) null hypothesis.
In this scenario, p values take on a uniform distribution. Calling this
function with any integer N is equivalent to simulating N studies with a true difference of 0.
%%% Code for null P value distribution %%%
null_true_p <- function(n) {
null <- c(runif(n))
result <- quantile(null, probs = c(0.00, 0.025, 0.05, 0.25, 0.50, 0.75, 0.95, 0.975, 1) )
hist(null)
result
}
%%% End for Null P value distribution %%%
2. P values under a true difference (standardized effect size of 0.53)
The median P value varies based on how many studies are done, but is frequently 0.75 \gt p \le 0.95, with small numbers of studies (a common scenario in meta-analysis). Note the skew the distribution. But skeptics will almost always be able to find a large number of “contradictory” studies going by “significance” alone.
%%% Code for moderate true effect %%%
null_false_p <- function(sample, sim_size) {
reject <- numeric(sim_size)
for (i in 1:sim_size) {
x <- rnorm(sample, 100, 15)
y <- rnorm(sample, 108, 15)
reject[i] <- t.test(x,y,alternative="greater")$p.value
result <- quantile(reject, probs = c(0.00, 0.025, 0.05, 0.25, 0.50, 0.75, 0.95, 0.975, 1))
}
hist(reject)
result
}
%%% End Code for Moderate Effect %%%
3. Meta-analysis under null
The following code simulates a meta-analysis based on p value combination using Stouffer’s method (ie. inverse normal transform). 95% of the (one-tailed) T values are within the range of \pm 1.645; 95% are in the range of \pm 2.326, and 99.9% are in the range of \pm 3.09)
%%% Code for Meta under null %%%
true_null_meta <- function(sample, sim_size ) {
stouffer <- numeric(sim_size)
for (i in 1:sim_size) {
null <- runif(sample)
z_score <- c(qnorm(null))
stouffer[i] <- sum( z_score/sqrt(sample) )
result <- quantile(stouffer, probs = c(0.00, 0.025, 0.05, 0.25, 0.50, 0.75, 0.95, 0.975, 1))
}
hist(stouffer)
result
}
%%% End of code for Null Meta %%%
4. Meta-analysis with true effect
Finally, here is Stouffer’s method when the alternative is true, and there is a difference. The magnitude of Stouffer’s statistic will depend on the number of studies (as well as the magnitude of effect).
%%% Code for Meta of True Moderate Effect %%%
false_null_meta <- function(sample, sim_size) {
z_scores <- numeric(sample)
p_vals <- numeric(sample)
stouffer <- numeric(sim_size)
for (i in 1:sim_size) {
x <- rnorm(sample, 100, 15)
y <- rnorm(sample, 108, 15)
p_vals[i] <- t.test(x,y, alternative = "greater")$p.value
z_scores[i] <- c(qnorm(p_vals[i]))
stouffer[i] <- sum( z_scores / sqrt(sample) )
result_z <- quantile(z_scores, probs = c(0.00, 0.025, 0.05, 0.25, 0.50, 0.75, 0.95, 0.975, 1))
result_s <- quantile(stouffer, probs = c(0.00, 0.025, 0.05, 0.25, 0.50, 0.75, 0.95, 0.975, 1))
}
hist(stouffer)
print(result_z)
result_s
}
%%% End of Code for Meta of True Effect %%%
Those who want to run the sims can do so at https://rdrr.io/snippets/. The plain text code can be found on pastebin here
You can call them, supplying the appropriate integer. IE:
null_true_p(30)
null_false_p(15,30)
This will simulate 30 random p values, and 30 skewed p values, with each
individual p in the latter being based on a sample size of 15. A histogram will be
drawn for both results.
For a meta-analysis simulation, each function requires 2 values – a sample
size for the individual (meta) study, and sim_size that determines how many
(meta) studies to simulate.
The code is very rough at this point. I didn’t account for power in these simulations; that is done informally via increasing the sample size and/or number of studies. That would be a very useful addition to improve intuition in examining a collection of studies. But I think it helps illustrate the common problems with p value and “significance” test (mis)-interpretations.
Reference
Median of the p Value Under the Alternative Hypothesis