Language for communicating frequentist results about treatment effects

Except for misguided variable selection :frowning:

1 Like

Very nice and useful !

1 Like

This proposal to transform p values into logs is nearly identical to Fisher’s procedure to combine significance tests in a meta-analysis.

(The Wikipedia link uses natural logs, but Hedges and Olkin have it expressed in logs without necessarily clarifying if it is base 2 or not).

The team of Elena Kulinskaya, Stephan Morgenthaler, and Robert G. Staudte published a book in 2008 on Meta-analysis that takes the approach of defining “evidence” as the inverse normal transform of the p value \Phi^{-1}(1-p). Their techniques look very similar to a weighted Stouffer combined test which is an average of Z scores:

T_{1..k} = \frac{\Sigma{\surd{n_{1}}(S_1)}...{\surd{n_{k}}(S_k)}}{\Sigma{\surd{n_{1}...\surd{n_{k}}}}} where S is a variance stabilizing transform of the statistic used to calculate the p value.

Instead of using the statistic to “test” hypotheses, they point out that this transform can be used to quantify “evidence” of the alternative (without specifying any particular effect). In this framework, evidence is always a random quantity, with a standard error of \pm{1}.

‘Weak’ evidence for (unspecified) alternative is T = \pm 1.64 , that corresponds to a p = 0.05,
‘Moderate’ evidence is T= \pm 3.3, that corresponds to a p = 0.0004
‘Strong’ evidence is T= \pm 5, corresponding to a p value somewhere near 2.86 * 10^{-7}

In favor of this transform, I notice that some common p value errors can be avoided:

  1. absence of evidence is evidence of absence.
  2. can demonstrate Simpson’s paradox in relation to p values (ie. a number of ‘not significant’ results can in aggregate provide evidence for an effect, and vice versa).
  3. It can be useful in meta-analysis.

While it doesn’t solve all of the problems with P values, it seems to alleviate some of the most troubling ones.

My confusion stems from the caution expressed in Hedges and Olkin Statistical Methods for Meta-Analysis, where they state:

…an investigator may incorrectly conclude that because H_{0} is rejected, the treatment effects are greater than zero … Alternatively, an investigator might incorrectly conclude that the treatment showed consistent effect across studies…

It would seem that the authors above are making interpretations that are warned against by a widely cited text. Or are they both correct, in their own way?

Google preview of book:

Journal of Evolutionary Biology banner


Free Access

Is the weighted z ‐test the best method for combining probabilities from independent tests?


Theory has proven that no method is uniformly most powerful (Birnbaum, 1954). However, it is possible that under a certain situation, one particular method may outperform others. For example, through simulation Whitlock (Whitlock, 2005) has shown that when all the studies have the same effect sizes, the weighted z ‐test outperforms both regular z ‐test and Fisher method. In this paper, we use simulation to show that under the same situation, the generalized Fisher method is more powerful than the weighted z ‐test.

Long story short : jury is still out, and statisticians are still rising to the challenge of rainy afternoons by running simulation studies.


Thanks. I do know that each of these p value combination methods has a scenario where it works best.

I was looking for a transform that could be easily taught that prevents the lengthy list of cognitive errors statisticians have been complaining about for decades.

The Kulinskaya, Morgenthaler, and Staudte book persuaded me that the modified z score test, with weights based on \surd{n} from each study, is a very reasonable alternative scale that will prevent the “absence of evidence” fallacy, and can also show how a number of “insignificant” studies can in aggregate, demonstrate an effect.

I went back and used the original Stouffer procedure on the P values in the Greenland, Amrheim, and McShane article in Nature. Using the evidential interpretation, there was moderate evidence of a positive association even without the sample sizes.

Now I’d like to get back to single study interpretations, the original point of this discussion topic. Thanks.

A variant may be to conclude that the study did not achieve its primary aim, so “no effect”. For example, the GOLD study. What do you think about the interpretation of the result in the abstract?
Is it possible that in these large trials the language is a bad adaptation of the jargon of the regulatory agencies?

“The GOLD study did not meet its primary objective of showing a significant improvement in overall survival with olaparib in the overall or ATM-negative population of Asian patients with advanced gastric cancer.” is not bad but “significant” doesn’t help.

I don’t know if with 2 more patients the conclusion would have been the opposite, look above:
Overall survival did not differ between treatment groups in the overall patient population (median overall survival 8·8 months [95% CI 7·4–9·6] in the olaparib group vs 6·9 months [6·3–7·9] in the placebo group; HR 0·79 [97·5% CI 0·63–1·00]; p=0·026) or in the ATM-negative population (12·0 months [7·8–18·1] vs 10·0 months [6·4–13·3]; 0·73 [0·40–1·34]; p=0·25).

I’ve been following this discussion & it reminded me that 3 years ago I wrote a draft of a paper extending the concept of “fragility” to the derivation of diagnostic algorithms based on statistical metric thresholds. I wasn’t convinced of the value (or indeed correct interpretation) of what I’d done, so I’ve not tried to polish or publish it. Nevertheless, the concept of just how susceptible an algorithm is to the vagaries of chance recruitment of a small number of patients is something that is worth thinking about.
I’ve put the draft I wrote as a pre-print on ResearchGate:

That’s horrible!
I guess there is a discussion to be had about whether the survival benefit is worth it (it’s small, drug probably expensive, side effects) but just dismissing this as “no difference” seems to me not to be serving patients well.

Unfortunately, statisticians have allowed and even promoted the use of the p-value threshold as the exclusive determinant of intervention effectiveness. People that rely on these determinations are rightly concerned about results near the threshold, where the determination might have swung in the opposite direction if only one or two study participants had experienced a different outcome. That we now have the “fragility index” is a consequence of the unfortunate dominance of this approach to inference. I agree with @f2harrell that it’s a band-aid.