Except for misguided variable selection

# Language for communicating frequentist results about treatment effects

**R_cubed**#123

This proposal to transform p values into logs is nearly identical to Fisher’s procedure to combine significance tests in a meta-analysis.

(The Wikipedia link uses natural logs, but Hedges and Olkin have it expressed in logs without necessarily clarifying if it is base 2 or not).

The team of Elena Kulinskaya, Stephan Morgenthaler, and Robert G. Staudte published a book in 2008 on Meta-analysis that takes the approach of defining “evidence” as the inverse normal transform of the p value \Phi^{-1}(1-p). Their techniques look very similar to a weighted Stouffer combined test which is an average of Z scores:

T_{1..k} = \frac{\Sigma{\surd{n_{1}}(S_1)}...{\surd{n_{k}}(S_k)}}{\Sigma{\surd{n_{1}...\surd{n_{k}}}}} where S is a variance stabilizing transform of the statistic used to calculate the p value.

Instead of using the statistic to “test” hypotheses, they point out that this transform can be used to quantify “evidence” of the alternative (without specifying any particular effect). In this framework, evidence is always a random quantity, with a standard error of \pm{1}.

‘Weak’ evidence for (unspecified) alternative is T = \pm 1.64 , that corresponds to a p = 0.05,

‘Moderate’ evidence is T= \pm 3.3, that corresponds to a p = 0.0004

‘Strong’ evidence is T= \pm 5, corresponding to a p value somewhere near 2.86 * 10^{-7}

In favor of this transform, I notice that some common p value errors can be avoided:

- absence of evidence is evidence of absence.
- can demonstrate Simpson’s paradox in relation to p values (ie. a number of ‘not significant’ results can in aggregate provide evidence for an effect, and vice versa).
- It can be useful in meta-analysis.

While it doesn’t solve all of the problems with P values, it seems to alleviate some of the most troubling ones.

My confusion stems from the caution expressed in Hedges and Olkin Statistical Methods for Meta-Analysis, where they state:

Blockquote

…an investigator may incorrectly conclude that because H_{0} is rejected, the treatment effects are greater than zero … Alternatively, an investigator might incorrectly conclude that the treatment showed consistent effect across studies…

It would seem that the authors above are making interpretations that are warned against by a widely cited text. Or are they both correct, in their own way?

Google preview of book:

https://books.google.com/books?id=uHG7jZ8ZAJ8C

**RonanConroy**#124

(post withdrawn by author, will be automatically deleted in 24 hours unless flagged)

**RonanConroy**#125

SHORT COMMUNICATION

Free Access

## Is the weighted *z* ‐test the best method for combining probabilities from independent tests?

Theory has proven that no method is uniformly most powerful (Birnbaum, 1954). However, it is possible that under a certain situation, one particular method may outperform others. For example, through simulation Whitlock (Whitlock, 2005) has shown that when all the studies have the same effect sizes, the weighted

z‐test outperforms both regularz‐test and Fisher method. In this paper, we use simulation to show that under the same situation, the generalized Fisher method is more powerful than the weightedz‐test.

Long story short : jury is still out, and statisticians are still rising to the challenge of rainy afternoons by running simulation studies.

**R_cubed**#126

Thanks. I do know that each of these p value combination methods has a scenario where it works best.

I was looking for a transform that could be easily taught that prevents the lengthy list of cognitive errors statisticians have been complaining about for decades.

The Kulinskaya, Morgenthaler, and Staudte book persuaded me that the modified z score test, with weights based on \surd{n} from each study, is a very reasonable alternative scale that will prevent the “absence of evidence” fallacy, and can also show how a number of “insignificant” studies can in aggregate, demonstrate an effect.

I went back and used the original Stouffer procedure on the P values in the Greenland, Amrheim, and McShane article in Nature. Using the evidential interpretation, there was moderate evidence of a positive association even without the sample sizes.

**f2harrell**#127

Now I’d like to get back to single study interpretations, the original point of this discussion topic. Thanks.

**albertoca**#128

A variant may be to conclude that the study did not achieve its primary aim, so “no effect”. For example, the GOLD study. What do you think about the interpretation of the result in the abstract? https://www.sciencedirect.com/science/article/pii/S1470204517306824

Is it possible that in these large trials the language is a bad adaptation of the jargon of the regulatory agencies?

**f2harrell**#129

“The GOLD study did not meet its primary objective of showing a significant improvement in overall survival with olaparib in the overall or ATM-negative population of Asian patients with advanced gastric cancer.” is not bad but “significant” doesn’t help.

**albertoca**#130

I don’t know if with 2 more patients the conclusion would have been the opposite, look above:

Overall survival **did not differ** between treatment groups in the overall patient population (median overall survival 8·8 months [95% CI 7·4–9·6] in the olaparib group vs 6·9 months [6·3–7·9] in the placebo group; **HR 0·79 [97·5% CI 0·63–1·00]**; p=0·026) or in the ATM-negative population (12·0 months [7·8–18·1] vs 10·0 months [6·4–13·3]; 0·73 [0·40–1·34]; p=0·25).