Language for communicating frequentist results about treatment effects

I see what you’re saying. In my experience—mostly biostats—the type of model has already been described in some level of detail in the statistical methods section, and then you can refer to it as “the [multivariable] model” in the results section. I agree that the level of detail almost always is never enough to reproduce it without the need to separately contact the authors, with the different terminology not making things any easier. Hopefully the push for publishing reproducible analysis scripts alongside papers will help clear up that issue.

but you can’t claim a hankering for brevity when the specific model, cox, glm, gee, aft whatever, is less of a mouthful than “multivariable regression”, or just say ‘the model’. Forgive me, but it is people merely trying to make basic regression sound thorough and cogent and i feel they are making our profession seem a little foolish and inexact

My impression has always been that “multivariable” refers to models with more than one predictor in a model, and that it came into use because people were incorrectly referring to models with only one response variable as “multivariate.”

1 Like

yes, as i noted above, and i think this reveals the exact motivation, ie it’s not an attempt to be informative or accurate, that’s beside the point, it’s just an eargerness to use highfalutin words. Ironically (and tellingly), the word is used in epidemiology, where no one ever fits univariate models, instead of clinical trials where Frank Harrell pointed out on twitter: univaiate analysis is unfortunately common

Except for misguided variable selection :frowning:

1 Like

Very nice and useful !

1 Like

This proposal to transform p values into logs is nearly identical to Fisher’s procedure to combine significance tests in a meta-analysis.

(The Wikipedia link uses natural logs, but Hedges and Olkin have it expressed in logs without necessarily clarifying if it is base 2 or not).

The team of Elena Kulinskaya, Stephan Morgenthaler, and Robert G. Staudte published a book in 2008 on Meta-analysis that takes the approach of defining “evidence” as the inverse normal transform of the p value \Phi^{-1}(1-p). Their techniques look very similar to a weighted Stouffer combined test which is an average of Z scores:

T_{1..k} = \frac{\Sigma{\surd{n_{1}}(S_1)}...{\surd{n_{k}}(S_k)}}{\Sigma{\surd{n_{1}...\surd{n_{k}}}}} where S is a variance stabilizing transform of the statistic used to calculate the p value.

Instead of using the statistic to “test” hypotheses, they point out that this transform can be used to quantify “evidence” of the alternative (without specifying any particular effect). In this framework, evidence is always a random quantity, with a standard error of \pm{1}.

‘Weak’ evidence for (unspecified) alternative is T = \pm 1.64 , that corresponds to a p = 0.05,
‘Moderate’ evidence is T= \pm 3.3, that corresponds to a p = 0.0004
‘Strong’ evidence is T= \pm 5, corresponding to a p value somewhere near 2.86 * 10^{-7}

In favor of this transform, I notice that some common p value errors can be avoided:

  1. absence of evidence is evidence of absence.
  2. can demonstrate Simpson’s paradox in relation to p values (ie. a number of ‘not significant’ results can in aggregate provide evidence for an effect, and vice versa).
  3. It can be useful in meta-analysis.

While it doesn’t solve all of the problems with P values, it seems to alleviate some of the most troubling ones.

My confusion stems from the caution expressed in Hedges and Olkin Statistical Methods for Meta-Analysis, where they state:

…an investigator may incorrectly conclude that because H_{0} is rejected, the treatment effects are greater than zero … Alternatively, an investigator might incorrectly conclude that the treatment showed consistent effect across studies…

It would seem that the authors above are making interpretations that are warned against by a widely cited text. Or are they both correct, in their own way?

Google preview of book:

Journal of Evolutionary Biology banner


Free Access

Is the weighted z ‐test the best method for combining probabilities from independent tests?


Theory has proven that no method is uniformly most powerful (Birnbaum, 1954). However, it is possible that under a certain situation, one particular method may outperform others. For example, through simulation Whitlock (Whitlock, 2005) has shown that when all the studies have the same effect sizes, the weighted z ‐test outperforms both regular z ‐test and Fisher method. In this paper, we use simulation to show that under the same situation, the generalized Fisher method is more powerful than the weighted z ‐test.

Long story short : jury is still out, and statisticians are still rising to the challenge of rainy afternoons by running simulation studies.


Thanks. I do know that each of these p value combination methods has a scenario where it works best.

I was looking for a transform that could be easily taught that prevents the lengthy list of cognitive errors statisticians have been complaining about for decades.

The Kulinskaya, Morgenthaler, and Staudte book persuaded me that the modified z score test, with weights based on \surd{n} from each study, is a very reasonable alternative scale that will prevent the “absence of evidence” fallacy, and can also show how a number of “insignificant” studies can in aggregate, demonstrate an effect.

I went back and used the original Stouffer procedure on the P values in the Greenland, Amrheim, and McShane article in Nature. Using the evidential interpretation, there was moderate evidence of a positive association even without the sample sizes.

Now I’d like to get back to single study interpretations, the original point of this discussion topic. Thanks.

A variant may be to conclude that the study did not achieve its primary aim, so “no effect”. For example, the GOLD study. What do you think about the interpretation of the result in the abstract?
Is it possible that in these large trials the language is a bad adaptation of the jargon of the regulatory agencies?

“The GOLD study did not meet its primary objective of showing a significant improvement in overall survival with olaparib in the overall or ATM-negative population of Asian patients with advanced gastric cancer.” is not bad but “significant” doesn’t help.

I don’t know if with 2 more patients the conclusion would have been the opposite, look above:
Overall survival did not differ between treatment groups in the overall patient population (median overall survival 8·8 months [95% CI 7·4–9·6] in the olaparib group vs 6·9 months [6·3–7·9] in the placebo group; HR 0·79 [97·5% CI 0·63–1·00]; p=0·026) or in the ATM-negative population (12·0 months [7·8–18·1] vs 10·0 months [6·4–13·3]; 0·73 [0·40–1·34]; p=0·25).

I’ve been following this discussion & it reminded me that 3 years ago I wrote a draft of a paper extending the concept of “fragility” to the derivation of diagnostic algorithms based on statistical metric thresholds. I wasn’t convinced of the value (or indeed correct interpretation) of what I’d done, so I’ve not tried to polish or publish it. Nevertheless, the concept of just how susceptible an algorithm is to the vagaries of chance recruitment of a small number of patients is something that is worth thinking about.
I’ve put the draft I wrote as a pre-print on ResearchGate:

That’s horrible!
I guess there is a discussion to be had about whether the survival benefit is worth it (it’s small, drug probably expensive, side effects) but just dismissing this as “no difference” seems to me not to be serving patients well.

Unfortunately, statisticians have allowed and even promoted the use of the p-value threshold as the exclusive determinant of intervention effectiveness. People that rely on these determinations are rightly concerned about results near the threshold, where the determination might have swung in the opposite direction if only one or two study participants had experienced a different outcome. That we now have the “fragility index” is a consequence of the unfortunate dominance of this approach to inference. I agree with @f2harrell that it’s a band-aid.

I wasn’t sure where to put a reference to this interesting paper by Michael Lew on the relationship between (one tailed) P values and likelihood. It discusses some of the complexities on when p values have to be adjusted (ie. when used in the NP decision framework, not in Fisher’s). This seemed to be the best thread.

I learned about it in this thread on Deborah Mayo’s post criticizing Royall’s philosophy of statistical evidence:

Anyone who attempts to counter Richard Royall must come well armed …

1 Like

I found Royall’s evidential perspective via likelihood very helpful. Michael Lew does as well.

Mayo’s critique is that error control is missing from the analysis, and we need some design info for that. But it seems that error control is a function of the experimental design (before data point of view), where we fix the effect to a specific value. It does not seem relevant after data have been collected, as long as we assume the experiment was minimally informative (1-\beta > \alpha).

A post data view is interested in how the observed data supports a range of effects, above and below the value used in the design phase. This would seem to be related to your post on Bayesian power.

Perhaps it would be better to look at inference as precise estimation, rather than Mayo’s concept of “severe tests.”

I just thought Lew’s paper was a useful link between what people currently use now, and how they related to a better measure (likelihoods) and hoped to get some expert input on that.

Exactly, and remember that she (and many others) uses the word “error” in a problematic way IMHO. She is speaking of the probability of making an assertion of an effect when the effect is exactly zero. In my view a true error probability is P(effect <= 0 | data) when you act from the data as if effect > 0.