Using p-values, S-values, and the Bayesian Rejection Ratio to re-examine a meta-analysis on knee abduction kinematics and ACL injury risk

A colleague brought to my attention the following peer-reviewed meta-analysis of 12 prognostic studies that attempted to predict future ACL injury risk from knee kinematics.

Title: Do knee abduction kinematics and kinetics predict future anterior cruciate ligament injury risk? A systematic review and meta-analysis of prospective studies

The authors found 9 studies, using 3 different assessments, that attempt to detect excess angular motion as a predictor of future ACL injury. They reported 12 confidence intervals, which were grouped according to the assessment used. Figure 2 of their report shows 3 separate estimates of aggregate effect size, in either degrees or centimeters of displacement.

While an examination of their aggregate compatibility intervals demonstrated a shift in favor of the prognostic value of knee kinematics, but all of their computed intervals included 0, leading the authors to conclude:

Contrary to clinical opinion, our findings indicate that knee abduction kinematics and kinetics during weight-bearing activities may not be risk factors for future ACL injury.

Using only the data reported (in Figure 2), a re-analysis using p-value combination implied by the intervals shows, with \alpha selected using Bayesian reasoning, that contrary to the conclusion of the authors, to disbelieve in the predictive value of knee kinematics requires such
high prior odds in favor of the null as to be unreasonable.

Bayarri, Berger, and Selke (2017) provide the rationale for a Bayesian interpretation of frequentist procedures.

Expressing Bayes’ Theorem in odds:

O_{pre} = \frac{\pi_{0}}{\pi_1} \times \frac{1 - \bar\beta }{\alpha}

The evidential value of an experiment to detect the existence of an effect, is the ratio of \frac{1 - \bar\beta}{\alpha}, a quantity they call the rejection ratio. By considering the context, the prior odds, power and \alpha can be manipulated to design an experiment that shifts our beliefs by the amount necessary for the decision at hand.

At conventional type I (0.05) and type II (0.2) error probabilities, a single experiment can only be expected to produce evidence of 16:1 in favor of an alternative, conditional on actually obtaining a p value below the specified \alpha. For hypotheses with lower prior odds, multiple conceptually similar studies must be performed and combined before enough evidence will overcome a skeptical prior.

For the purpose of this re-analysis, the following assumptions are made:

  1. Prior odds on detecting an effect = 1:19; Prior probability 0.05 in favor of effect.
  2. Type II error of a single study = 0.6; Power = 0.4. \alpha = 0.05. Since \frac{0.4}{0.05} > 1, a single study has information value, but is not likely to reject the null reference hypothesis of no effect on its own.
  3. Target Posterior odds 9:1 in favor of kinematics predicting injury risk.
  4. Combined power = 1 - 0.6^{12} = 0.997
  5. Solving for \alpha = \frac{0.997}{9 \times19} = 0.005.

Despite discounting the power of any individual study, combining the studies and reducing \alpha by a factor of 10 provides a posterior odds greater than 9:1 in favor of a true effect in at least 1 study, if our observed p-value is lower than this level (specified before looking at the data).

P-values are difficult to interpret. An alternative representation of p-values has been recommended by Greenland and Rafi (2020) called the S-value. It is the base 2 log transformation of log_2(\frac{1}{p}) or -log_2(p). This rescaling provides bits of information against the asserted test hypothesis, and is known as the surprisal. Intuitively, it compares the observed p-value to a set of flips of a fair coin. Very low p-values translate to high surprisal values, suggesting the asserted model is unlikely.

The important point: a surprisal of 0 asserts there is no information, not that the information supports the null.

The data reported in Figure 2 were manually extracted and entered into a spreadsheet (LibreOffice Calc One sided p-values can be recovered by the procedure described in Knapp, Hartaung, and Sinha (p. 31) (link) and steps 1 and 2 of Altman (2011) (link)

The equally weighted Stouffer method will be used to combine the Z scores.

Author Number Task Factor Mean Sign Lower Upper CI Diff Std Error Z score 1 side P Study S-val Log Bayes Study Bayes’
Raisanen 1 SLS 2d peak -7.71 -1 -18.09 2.67 20.76 5.296 -1.456 0.0727 3.782 -0.658 1.930
Dingenen 2 SDJ 2d peak -0.7 -1 -8.19 6.79 14.98 3.821 -0.183 0.4273 1.227 -0.012 1.013
Georger 3 VDJ 3d IC 1.65 1 -2.56 5.86 8.42 2.148 0.768 0.7788 0.361 0.636 0.529
Hewett 4 VDJ 3d IC -8.11 -1 -14.25 -1.97 12.28 3.133 -2.589 0.0048 7.698 -2.662 14.319
Crosshaug 5 VDJ 3d IC -0.5 -1 -2.01 1.01 3.02 0.770 -0.649 0.2582 1.954 -0.051 1.052
Leppanen 6 VDJ 3d IC -2.7 -1 -6.15 0.75 6.9 1.760 -1.534 0.0625 3.999 -0.753 2.122
Georger 7 VDJ 3d Peak 1.33 1 -3.97 6.63 10.6 2.704 0.492 0.6886 0.538 0.359 0.698
Hewett 8 VDJ 3d Peak -7.6 -1 -13.36 -1.84 11.52 2.939 -2.586 0.0049 7.687 -2.655 14.226
Nilstad 9 VDJ 3d Peak -0.65 -1 -5.55 4.25 9.8 2.500 -0.260 0.3974 1.331 -0.003 1.003
Crosshaug 10 VDJ MKD -0.3 -1 -0.78 0.18 0.96 0.245 -1.225 0.1103 3.181 -0.414 1.513
Leppanen 11 VDJ MKD 0.4 1 -0.63 1.43 2.06 0.526 0.761 0.7767 0.365 0.628 0.533
Numata 12 VDJ MKD -1.3 -1 -3.34 0.84 4.18 1.066 -1.219 0.1114 3.166 -0.409 1.505
Sum -9.680 -5.992 400.406
Stouffer Test -2.794
p value 0.003
Omnibus S-val 8.587

Looking at the Z Score column, the Stouffer method computes an aggregate Z of -2.794, or produces a one-sided p-value of 0.003. A Bayesian interpretation of the p-value combination method suggests there is at least 1 study that produced a true effect with \ge 90% posterior probability.

Alternatively, we could say that, conditional the assumption that there was no effect in any study, this data provides about 8 bits of information against the null.

Later, I’ll describe how I’d use the log Bayes factor bound.


Thanks for this interesting comparison of summary techniques, Robert! One point that your table makes very clear about S-values is that the 8 bits of evidence accumulated thus far against the null can only increase with addition of more studies. (This has long been appreciated; my point is only that the additive nature of the S-value makes this especially clear.)

If you were employing S-values as advised by Rafi & Greenland (2020), you would be “displaying the compatibility of the observed data with various hypotheses of interest, rather than focusing on single hypothesis tests or interval estimates.” Do the underlying papers enable such deep examination of substantive biomechanics & sports-science theories?

1 Like

Just a few points:

  1. What I call the Omnibus p-value is merely the transformation of the inverse normal p-value (the cell above), and then re-scaling to bits of information. To combine the p-values directly, I’d use Fisher’s method (only converting to natural logs), get the p-value from the Chi-squared distribution, then take the base 2 log of that for better context. It shouldn’t matter what log transformation to use, but the Fisher method has long been studied, and it is easy to check with tables or other software.

AFAIK, the Fisher method has the best Bayesian characteristics, but the Stouffer method has a very high correlation with it, and it presents the information on a symmetrical scale, so you can’t miss an effect in the opposite direction (at least when combining 1 tailed p-values).

To do what is suggested in that paper would require the individual study data, which is not available to me.

  1. The only “significant” change I’ve made to the standard use of these p-value combination methods is the introduction of the prior odds as recommended by Bayarri, Berger, and Selke in that great paper.
    I’m wondering if it might be better to use it in a reverse Bayes fashion to derive the priors – aka: from skeptical to uncertain, and from skeptical to “beyond reasonable doubt” (aka posterior prob > 0.9).

By being extremely charitable to the authors in terms of assumptions, I think the analysis demonstrates the narrative section of the paper is incorrect.

  1. I tried looking up some of the studies. Most were unavailable, except the last one by Numata. Their reported CI for that study is wrong. He reported a p-value, using a different assessment than the strong findings by Hewett, of 0.006 – a very surprising result. I’ll post the corrected data later in the week.

So, in their 12 reported intervals, 3 had p-values below 0.005, but their aggregation of estimates could not find any signal. It wasn’t directly stated, but I’m betting they used the typical DerSimonian-Laird method, which is known to have low power with the number of studies they used.

What I find remarkable – everyone places meta-analysis at the top of the “evidence pyramid”, but the way they interpret “statistical significance” for single studies is essentially inconsistent with the idea of combining studies to maximize power to detect effects.

1 Like