Language for communicating frequentist results about treatment effects

I confess to being guilty of this. My question: would we even be able to say an N% bootstrap “confidence” interval has this intuitive interpretation? It is based on actual repeated sampling of the observed data, and via the ‘plug-in principle’ that justifies the bootstrap, I cannot see why the bootstrap interpretation would be an error.

Could someone clarify this distinction (if any) between the classically computed interval, and the bootstrap one?

Update: Relevant paper by Efron and DiCiccio on Bootstrap intervals. He alludes to problems with intervals derived from asymptotic theory, that can be addressed using the bootstrap.

Equation 1.1 refers to:
\hat{\theta} \pm z^{(\alpha)}\hat{\sigma}

The trouble with standard intervals is that they are based on an asymptotic approximation that can be quite inaccurate in practice. The example below illustrates what every applied statistician knows, that (1.1) can considerably differ from exact intervals in those cases where exact intervals exist. Over the years statisticians have developed tricks for improving (1.1), involving bias-corrections and parameter transformations. The bootstrap confidence intervals that we will discuss here can be thought of as automatic algorithms for carrying out these improvements without human intervention.

Update 2: Empirical evaluation of classic vs bootstrap intervals from statisticians at U.S. Forest Service

Scanning the data, the differences in coverage were really small, but classic intervals were generally better.

Related to the prediction interval in this thread:

I’ve been driving myself mad trying to figure out a mathematical justification for this result. The best I can come up with so far is:

Upper bound on 2 random 95% intervals is
0.95^2 = 0.9025

Lower bound for particular interval from planned experiment = CI*Power, where:
CI = 0.95, Power = 0.8

Probability of future interval covering parameter
0.95 * 0.8 = 0.76

Averaging UB and LB:

\frac{1}{2}(0.9025 + 0.76) = 0.83125

I tried looking up the paper, but do not have access to it at the moment. Has someone read it? Did they give a mathematical justification there? Thanks.


I agree with Frank. I think either are better than the original, the second being most aligned with what he results really represent. I’ve been struggling with this myself as I slowly try to nudge the writers in my group in this direction. If you don’t mind, I’d sure like to borrow these as a template.


This is a fair comment, but just to clarify in these particular examples the “multivariable” models were mentioned in the sentences just prior to those I quoted above, e.g. by listing all of the variables that were adjusted for.

Thanks also @f2harrell and @Mike_Babyak for your input. There were many other places in the manuscript where I made minor wording changes along these lines, but these two got some push-back because I think they wanted to go down the p > 0.05 \Rightarrow \text{no association} route. But I got an email this morning that they had agreed to go with the second suggestion in each case, which I’m happy with. I’ll be interested to see if any reviewers take issue with either of them.

Feel free to use them however you like, @Mike_Babyak, and good luck with your nudging :slight_smile:


but it doesn’t tell the reader anything, it could be cox or glm or many other things. It only irks me because people (ie epidems) started adopting language to make basic calculations seem sophisticated eg “power analysis!” and “effect modifier!” i started hearing these some years ago. Here’s a paper showing differences between language used by epidem and biostat You say “to-ma-to,”I say “to-mah-to,”you say “covariate,”Isay… I can’t help but think of Dawkins’s Law: “Obscurantism in an academic subject expands to fill the vacuum of its intrinsic simplicity.”

I see what you’re saying. In my experience—mostly biostats—the type of model has already been described in some level of detail in the statistical methods section, and then you can refer to it as “the [multivariable] model” in the results section. I agree that the level of detail almost always is never enough to reproduce it without the need to separately contact the authors, with the different terminology not making things any easier. Hopefully the push for publishing reproducible analysis scripts alongside papers will help clear up that issue.

but you can’t claim a hankering for brevity when the specific model, cox, glm, gee, aft whatever, is less of a mouthful than “multivariable regression”, or just say ‘the model’. Forgive me, but it is people merely trying to make basic regression sound thorough and cogent and i feel they are making our profession seem a little foolish and inexact

My impression has always been that “multivariable” refers to models with more than one predictor in a model, and that it came into use because people were incorrectly referring to models with only one response variable as “multivariate.”

1 Like

yes, as i noted above, and i think this reveals the exact motivation, ie it’s not an attempt to be informative or accurate, that’s beside the point, it’s just an eargerness to use highfalutin words. Ironically (and tellingly), the word is used in epidemiology, where no one ever fits univariate models, instead of clinical trials where Frank Harrell pointed out on twitter: univaiate analysis is unfortunately common

Except for misguided variable selection :frowning:

1 Like

Very nice and useful !

1 Like

This proposal to transform p values into logs is nearly identical to Fisher’s procedure to combine significance tests in a meta-analysis.

(The Wikipedia link uses natural logs, but Hedges and Olkin have it expressed in logs without necessarily clarifying if it is base 2 or not).

The team of Elena Kulinskaya, Stephan Morgenthaler, and Robert G. Staudte published a book in 2008 on Meta-analysis that takes the approach of defining “evidence” as the inverse normal transform of the p value \Phi^{-1}(1-p). Their techniques look very similar to a weighted Stouffer combined test which is an average of Z scores:

T_{1..k} = \frac{\Sigma{\surd{n_{1}}(S_1)}...{\surd{n_{k}}(S_k)}}{\Sigma{\surd{n_{1}...\surd{n_{k}}}}} where S is a variance stabilizing transform of the statistic used to calculate the p value.

Instead of using the statistic to “test” hypotheses, they point out that this transform can be used to quantify “evidence” of the alternative (without specifying any particular effect). In this framework, evidence is always a random quantity, with a standard error of \pm{1}.

‘Weak’ evidence for (unspecified) alternative is T = \pm 1.64 , that corresponds to a p = 0.05,
‘Moderate’ evidence is T= \pm 3.3, that corresponds to a p = 0.0004
‘Strong’ evidence is T= \pm 5, corresponding to a p value somewhere near 2.86 * 10^{-7}

In favor of this transform, I notice that some common p value errors can be avoided:

  1. absence of evidence is evidence of absence.
  2. can demonstrate Simpson’s paradox in relation to p values (ie. a number of ‘not significant’ results can in aggregate provide evidence for an effect, and vice versa).
  3. It can be useful in meta-analysis.

While it doesn’t solve all of the problems with P values, it seems to alleviate some of the most troubling ones.

My confusion stems from the caution expressed in Hedges and Olkin Statistical Methods for Meta-Analysis, where they state:

…an investigator may incorrectly conclude that because H_{0} is rejected, the treatment effects are greater than zero … Alternatively, an investigator might incorrectly conclude that the treatment showed consistent effect across studies…

It would seem that the authors above are making interpretations that are warned against by a widely cited text. Or are they both correct, in their own way?

Google preview of book:

Journal of Evolutionary Biology banner


Free Access

Is the weighted z ‐test the best method for combining probabilities from independent tests?


Theory has proven that no method is uniformly most powerful (Birnbaum, 1954). However, it is possible that under a certain situation, one particular method may outperform others. For example, through simulation Whitlock (Whitlock, 2005) has shown that when all the studies have the same effect sizes, the weighted z ‐test outperforms both regular z ‐test and Fisher method. In this paper, we use simulation to show that under the same situation, the generalized Fisher method is more powerful than the weighted z ‐test.

Long story short : jury is still out, and statisticians are still rising to the challenge of rainy afternoons by running simulation studies.


Thanks. I do know that each of these p value combination methods has a scenario where it works best.

I was looking for a transform that could be easily taught that prevents the lengthy list of cognitive errors statisticians have been complaining about for decades.

The Kulinskaya, Morgenthaler, and Staudte book persuaded me that the modified z score test, with weights based on \surd{n} from each study, is a very reasonable alternative scale that will prevent the “absence of evidence” fallacy, and can also show how a number of “insignificant” studies can in aggregate, demonstrate an effect.

I went back and used the original Stouffer procedure on the P values in the Greenland, Amrheim, and McShane article in Nature. Using the evidential interpretation, there was moderate evidence of a positive association even without the sample sizes.

Now I’d like to get back to single study interpretations, the original point of this discussion topic. Thanks.

A variant may be to conclude that the study did not achieve its primary aim, so “no effect”. For example, the GOLD study. What do you think about the interpretation of the result in the abstract?
Is it possible that in these large trials the language is a bad adaptation of the jargon of the regulatory agencies?

“The GOLD study did not meet its primary objective of showing a significant improvement in overall survival with olaparib in the overall or ATM-negative population of Asian patients with advanced gastric cancer.” is not bad but “significant” doesn’t help.

I don’t know if with 2 more patients the conclusion would have been the opposite, look above:
Overall survival did not differ between treatment groups in the overall patient population (median overall survival 8·8 months [95% CI 7·4–9·6] in the olaparib group vs 6·9 months [6·3–7·9] in the placebo group; HR 0·79 [97·5% CI 0·63–1·00]; p=0·026) or in the ATM-negative population (12·0 months [7·8–18·1] vs 10·0 months [6·4–13·3]; 0·73 [0·40–1·34]; p=0·25).

I’ve been following this discussion & it reminded me that 3 years ago I wrote a draft of a paper extending the concept of “fragility” to the derivation of diagnostic algorithms based on statistical metric thresholds. I wasn’t convinced of the value (or indeed correct interpretation) of what I’d done, so I’ve not tried to polish or publish it. Nevertheless, the concept of just how susceptible an algorithm is to the vagaries of chance recruitment of a small number of patients is something that is worth thinking about.
I’ve put the draft I wrote as a pre-print on ResearchGate:

That’s horrible!
I guess there is a discussion to be had about whether the survival benefit is worth it (it’s small, drug probably expensive, side effects) but just dismissing this as “no difference” seems to me not to be serving patients well.

Unfortunately, statisticians have allowed and even promoted the use of the p-value threshold as the exclusive determinant of intervention effectiveness. People that rely on these determinations are rightly concerned about results near the threshold, where the determination might have swung in the opposite direction if only one or two study participants had experienced a different outcome. That we now have the “fragility index” is a consequence of the unfortunate dominance of this approach to inference. I agree with @f2harrell that it’s a band-aid.