Stability of results from small RCTs

Blockquote
The ANDROMEDA-SHOCK trial discussion on this platform discussed how a Bayesian approach might allow a more nuanced interpretation of smaller RCTs whose results don’t reach “statistical significance,” but where early data seems to be “leaning” in one direction.

While I agree with our host that Bayesian methods are generally preferable, there are simple transforms of p values that can minimize this error. I took a recommendation of Sander Greenland seriously, and started studying p value transformations, also known as “omnibus significance tests” or “p value combination methods.”

By following the logic of testing methods to this point, I ultimately have to agree with Professor Harrell’s recommendation that the precision of the estimate is more informative than emphasis on power of a test.

Blockquote
Basing sample size calculations on the margin of error can lead to a study that gives scientifically relevant results, even if the results are not statistically significant. [emphasis in original – BBR Section 6.7]

The more I compare and contrast the theory of p values, with their actual use in practice, the more difficult I find it to accept research summaries of “positive” and “negative” trials.

I had a colleague ask me “Why is it that I can find hundreds of studies demonstrating positive effects (often with sample sizes in the hundreds), while a meta-analysis will reject the effect?” This was in the exercise science and physical rehabilitation literature.

This is a complicated issue. It appears 2 errors are being made:

  1. the informal use of improper vote count procedures based on significance tests, and
  2. improper understanding of hypothesis tests and confidence intervals (at the meta-analysis level).

I wrote the following 4 R functions. To keep things simple, the simulations assume a fixed effect for all studies. It would be interesting to simulate a random effects distribution of P values.

The first 2 to simulate the behavior of p values under the null and under an alternative of “moderate” effect: ie a standardized effect size of 0.53 in the code below. The median p value will be between 0.25 and 0.75 when the null is true.

1. P values under a true (point) null hypothesis.
In this scenario, p values take on a uniform distribution. Calling this
function with any integer N is equivalent to simulating N studies with a true difference of 0.

%%% Code for null P value distribution  %%% 

null_true_p <- function(n) {
  null <- c(runif(n))
  result <- quantile(null, probs = c(0.00, 0.025, 0.05, 0.25, 0.50, 0.75, 0.95, 0.975, 1) )
  hist(null)
  result 
} 

%%% End for Null P value distribution %%% 

2. P values under a true difference (standardized effect size of 0.53)
The median P value varies based on how many studies are done, but is frequently 0.75 \gt p \le 0.95, with small numbers of studies (a common scenario in meta-analysis). Note the skew the distribution. But skeptics will almost always be able to find a large number of “contradictory” studies going by “significance” alone.

%%%  Code for moderate true effect   %%%

null_false_p <- function(sample, sim_size) {
  reject <- numeric(sim_size)
    for (i in 1:sim_size) {
      x <- rnorm(sample, 100, 15)
      y <- rnorm(sample, 108, 15)
      reject[i] <- t.test(x,y,alternative="greater")$p.value
      result <- quantile(reject, probs = c(0.00, 0.025, 0.05, 0.25, 0.50, 0.75, 0.95, 0.975, 1))
     }
    hist(reject)
    result

} 

%%%  End Code for Moderate Effect %%% 

3. Meta-analysis under null
The following code simulates a meta-analysis based on p value combination using Stouffer’s method (ie. inverse normal transform). 95% of the (one-tailed) T values are within the range of \pm 1.645; 95% are in the range of \pm 2.326, and 99.9% are in the range of \pm 3.09)

%%% Code for Meta under null %%% 

true_null_meta <- function(sample, sim_size ) {
  stouffer <- numeric(sim_size)
  for (i in 1:sim_size) {
  null <- runif(sample)
  z_score <- c(qnorm(null))
  stouffer[i] <- sum( z_score/sqrt(sample) )
  result <- quantile(stouffer, probs = c(0.00, 0.025, 0.05, 0.25, 0.50, 0.75, 0.95, 0.975, 1))
  }
hist(stouffer)
result
}
%%% End of code for Null Meta %%% 

4. Meta-analysis with true effect
Finally, here is Stouffer’s method when the alternative is true, and there is a difference. The magnitude of Stouffer’s statistic will depend on the number of studies (as well as the magnitude of effect).

%%% Code for Meta of True Moderate Effect %%%

false_null_meta <- function(sample, sim_size) {
  z_scores <- numeric(sample)
  p_vals   <- numeric(sample)
  stouffer <- numeric(sim_size)
  for (i in 1:sim_size) {
    x <- rnorm(sample, 100, 15)
    y <- rnorm(sample, 108, 15)
    p_vals[i]   <- t.test(x,y, alternative = "greater")$p.value
    z_scores[i] <- c(qnorm(p_vals[i]))
    stouffer[i] <- sum( z_scores / sqrt(sample) ) 
    result_z    <- quantile(z_scores, probs = c(0.00, 0.025, 0.05, 0.25, 0.50, 0.75, 0.95, 0.975, 1))
    result_s    <- quantile(stouffer, probs = c(0.00, 0.025, 0.05, 0.25, 0.50, 0.75, 0.95, 0.975, 1))
  } 
   hist(stouffer)
   print(result_z)
   result_s   
}  

%%% End of Code for Meta of True Effect %%% 

Those who want to run the sims can do so at Run R code online. The plain text code can be found on pastebin here

You can call them, supplying the appropriate integer. IE:

null_true_p(30)
null_false_p(15,30)

This will simulate 30 random p values, and 30 skewed p values, with each
individual p in the latter being based on a sample size of 15. A histogram will be
drawn for both results.

For a meta-analysis simulation, each function requires 2 values – a sample
size for the individual (meta) study, and sim_size that determines how many
(meta) studies to simulate.

The code is very rough at this point. I didn’t account for power in these simulations; that is done informally via increasing the sample size and/or number of studies. That would be a very useful addition to improve intuition in examining a collection of studies. But I think it helps illustrate the common problems with p value and “significance” test (mis)-interpretations.

Reference
Median of the p Value Under the Alternative Hypothesis

2 Likes