Language for communicating frequentist results about treatment effects


Yuck. Not a fan. The only reason I’m not pushing back against it is that I haven’t seen it referenced that much yet, but maybe we need to get ahead of this before it gets too popular.



the MD MPH types like it, it seems. I guess they seek a solution but dont want to read about or master something more sophisticated and time-consuming (forgive me for saying so). I was once asked to review a paper about the fragility index. Suffice to say, the paper was rejected…



i once did a thorough search of the web looking for a more critical view of the fragility index, and came across your blog post. I believe it was the only piece i found though…!



The father of clinical epidemiology Alvan Feinstein proposed a similar fragility approach 25 years ago. I sort of liked it at the time. But it was seldom used.



there is something offensive to me about contemplating: “What would the results be if we obtained data that we did not obtain?” Statisticians who treat data as sacrosanct ought to have a visceral reaction when non-biostatisticians wish to rejig data, treating a patient who is alive as dead and claiming that this is a small change (it is the largest change one could possibly make) and merely insinuating something that was already understood re uncertainty of results

1 Like


I mainly agree. I do feel that sensitivity analysis is occasionally warranted but I think that more often than not, sensitivity analysis creates doubt and opportunity for inserting opinions that are really motivated by confirmation bias, rather than sticking with a “best principled analysis”.



Here’s a beauty I encountered:

After correction for selection bias, the OR … (5·7, 95% CI 3·5–9·4) was similar to the original OR (5·7, 4·4–7·5). However, the CI for the corrected OR was wider than that for the original OR, suggesting a wider range of possible values for the true OR with 95% certainty.

So these rearchers make a probabilistic statement about the parameter based on data alone. Remarkable, right? :grimacing::zipper_mouth_face:



I have to say the craze of putting CIs behind every number makes articles harder to read as well.

I would rather see authors

  • make all data available (no data means they might as well have invented everything. And yes, I know it was a lot of work to gather the data, but no one ever promised you a rosegarden, OK? :grinning:)
  • put important findings in tables, and include standard errors. I am perfectly able to multiply these by some number (let’s say 1.96 :grin:) to come up with interesting CIs
  • just give the point estimate in the text. Yes, I know we have all sorts of variability and uncertainty. No need to rub that in with single every number.
  • and please stop suggesting that parameter values within a CI are more likely that outside the CI.

Thanks, authors! :+1:

1 Like


One minor thing: most compatibility intervals, when computed correctly, are asymmetric. So I don’t prefer the use of standard errors.



I’m currently negotiating with some clinical collaborators about the wording to be included alongside some odds ratios and their associated significance tests. Two sections in particular have become sticking points, and I was hoping to get some feedback or suggestions. NB X and Y are different things in the two Sections.

Section 1


Nevertheless, there remained no statistically significant difference in the multivariable analysis, despite a trend for decreased Y when X was used (p = 0.06; OR 0.53 [95% CI 0.27 – 1.02])

My suggestion #1:

Multivariable analysis indicated that the use of X was associated with a lower rate of Y, although we could not rule out a small adverse association at the 5% significance level (p = 0.06; OR 0.53 [95% CI 0.27 – 1.02]).

My suggestion #2:

We observed a lower rate of Y in patients with X, but the multivariable analysis returned a wide plausible range for the magnitude of the association and we could not exclude the possibility that there is no association or a small adverse effect at the 5% significance level (p = 0.06; OR 0.53 [95% CI 0.27 – 1.02]).

Section 2:


We found no significant relationship between the addition of X and an increased incidence of Y (p = 0.4; OR 1.61 [95% CI 0.53 – 4.90]).

My suggestion #1:

Although the addition of X was associated with an increased incidence of Y in our data, we could not rule out the possibility of no difference, or a large protective effect (p = 0.4; OR 1.61 [95% CI 0.53 – 4.90]).

My suggestion #2:

Although there was a slightly higher rate of Y in patients who received X in our data, the wide 95% confidence interval indicates that we cannot draw a strong conclusion about the strength or direction of the true association (p = 0.4; OR 1.61 [95% CI 0.53 – 4.90]).

I’d be grateful for anyone’s thoughts.

1 Like


I really like your presenting alternate wordings here hope that others respond also. I like the last suggestion of each section. I do prefer compatibility interval for confidence interval but you have to choose how many battles to fight in the war.



i have to say it, i really hate the expression “multivariable analysis”. They were saying multivariate but that was technically incorrect, and at some point they switched to “multivariable”. But every analysis is multivariable, thus it’s not a useful description, and better to specify the model used. Regarding the rest of it, i’d let the reader interpret the results rather than impose an interpretation according to the p-value criterion, to the extent that you can get away with it that is, which might not be much, you might say it was “suggestive of an association”. Unlike every other stato, i dont care so much about the p-value strife



I confess to being guilty of this. My question: would we even be able to say an N% bootstrap “confidence” interval has this intuitive interpretation? It is based on actual repeated sampling of the observed data, and via the ‘plug-in principle’ that justifies the bootstrap, I cannot see why the bootstrap interpretation would be an error.

Could someone clarify this distinction (if any) between the classically computed interval, and the bootstrap one?

Update: Relevant paper by Efron and DiCiccio on Bootstrap intervals. He alludes to problems with intervals derived from asymptotic theory, that can be addressed using the bootstrap.

Equation 1.1 refers to:
\hat{\theta} \pm z^{(\alpha)}\hat{\sigma}

The trouble with standard intervals is that they are based on an asymptotic approximation that can be quite inaccurate in practice. The example below illustrates what every applied statistician knows, that (1.1) can considerably differ from exact intervals in those cases where exact intervals exist. Over the years statisticians have developed tricks for improving (1.1), involving bias-corrections and parameter transformations. The bootstrap confidence intervals that we will discuss here can be thought of as automatic algorithms for carrying out these improvements without human intervention.

Update 2: Empirical evaluation of classic vs bootstrap intervals from statisticians at U.S. Forest Service

Scanning the data, the differences in coverage were really small, but classic intervals were generally better.

Related to the prediction interval in this thread:

I’ve been driving myself mad trying to figure out a mathematical justification for this result. The best I can come up with so far is:

Upper bound on 2 random 95% intervals is
0.95^2 = 0.9025

Lower bound for particular interval from planned experiment = CI*Power, where:
CI = 0.95, Power = 0.8

Probability of future interval covering parameter
0.95 * 0.8 = 0.76

Averaging UB and LB:

\frac{1}{2}(0.9025 + 0.76) = 0.83125

I tried looking up the paper, but do not have access to it at the moment. Has someone read it? Did they give a mathematical justification there? Thanks.

1 Like


I agree with Frank. I think either are better than the original, the second being most aligned with what he results really represent. I’ve been struggling with this myself as I slowly try to nudge the writers in my group in this direction. If you don’t mind, I’d sure like to borrow these as a template.



This is a fair comment, but just to clarify in these particular examples the “multivariable” models were mentioned in the sentences just prior to those I quoted above, e.g. by listing all of the variables that were adjusted for.

Thanks also @f2harrell and @Mike_Babyak for your input. There were many other places in the manuscript where I made minor wording changes along these lines, but these two got some push-back because I think they wanted to go down the p > 0.05 \Rightarrow \text{no association} route. But I got an email this morning that they had agreed to go with the second suggestion in each case, which I’m happy with. I’ll be interested to see if any reviewers take issue with either of them.

Feel free to use them however you like, @Mike_Babyak, and good luck with your nudging :slight_smile:



but it doesn’t tell the reader anything, it could be cox or glm or many other things. It only irks me because people (ie epidems) started adopting language to make basic calculations seem sophisticated eg “power analysis!” and “effect modifier!” i started hearing these some years ago. Here’s a paper showing differences between language used by epidem and biostat You say “to-ma-to,”I say “to-mah-to,”you say “covariate,”Isay… I can’t help but think of Dawkins’s Law: “Obscurantism in an academic subject expands to fill the vacuum of its intrinsic simplicity.”



I see what you’re saying. In my experience—mostly biostats—the type of model has already been described in some level of detail in the statistical methods section, and then you can refer to it as “the [multivariable] model” in the results section. I agree that the level of detail almost always is never enough to reproduce it without the need to separately contact the authors, with the different terminology not making things any easier. Hopefully the push for publishing reproducible analysis scripts alongside papers will help clear up that issue.



but you can’t claim a hankering for brevity when the specific model, cox, glm, gee, aft whatever, is less of a mouthful than “multivariable regression”, or just say ‘the model’. Forgive me, but it is people merely trying to make basic regression sound thorough and cogent and i feel they are making our profession seem a little foolish and inexact



My impression has always been that “multivariable” refers to models with more than one predictor in a model, and that it came into use because people were incorrectly referring to models with only one response variable as “multivariate.”

1 Like


yes, as i noted above, and i think this reveals the exact motivation, ie it’s not an attempt to be informative or accurate, that’s beside the point, it’s just an eargerness to use highfalutin words. Ironically (and tellingly), the word is used in epidemiology, where no one ever fits univariate models, instead of clinical trials where Frank Harrell pointed out on twitter: univaiate analysis is unfortunately common