Language for communicating frequentist results about treatment effects



Is this expressing a belief that most hypothesised causal effects cannot be exactly zero?

I don’t think it’s necessary to talk about causality, but I do agree that associations (in particular in the life sciences) are unlikely to be zero. That is very important, because it would mean that the focus on the presence or absence of an association is mistaken. Rather, we should be talking about the direction and magnitude, as Gelman suggests with his type S and M errors.

This topic is getting very long, however, so I might raise it as a new topic to see if anyone has good arguments against it.

Great idea!


I agree, but this was not my point: Any compatibility interval covering the null value (and hence p > 0.05), and any interval not covering the null value, “sheds some light on the magnitude of the effect.” An interval covering the null only shows that the null hypothesis is one of many different hypotheses that are compatible with the data, given the model.

Indeed, I would say that any compatibility interval that is not covering the entire range of possible values “sheds some light on the magnitude of the effect.” How useful the obtained information is depends, among other things, on where and how wide the interval is relative to the range of effect sizes we would judge interesting or important.


Not assuming a zero effect would probably change the language used for the results, at least sometimes. And probably inferences sometimes too. I’ll set up a ‘zero effect’ topic soon.

But I think mentioning causality would also be useful. Most research questions are causal, including any that assess ‘treatment effects’, because ‘effect’ implies a causal mechanism. And people seem to understand the world in terms of cause and effect. For example, perhaps the only way we can understand a non-causal association between variables in a study is by thinking of other causes that could produce the association, either through confounding or collider bias.

And the word ‘direction’ in Gelman’s suggestion likewise would apply to a causal question, because associations have no direction … actually, I’ve not interpreted the word ‘direction’ accurately. The word Gelman used was ‘sign’, which is probably what you meant by ‘direction’. He does use the term ‘direction’ in his article with Carlin (DOI: 10.1177/1745691614551642), but he means up or down (positive or negative). On the other hand, Gelman and Carlin do talk about ‘direction of effects’, suggesting they are, at least, thinking of causes.


Yep, that’s what I meant.


Those are good suggestions.


Fascinating discussion and a well-spent hour to start the morning. It made me wonder if it would be at all reasonable to consider a type of commentary manuscript in which one simply re-wrote the discussions from published articles in which the authors incorrectly interpreted their statistics. Or perhaps a style of review article in which, rather than simply summarizing the conclusions made by authors there was actually an attempt to critically reinterpret their reported data to make more accurate statements about the evidence supporting B versus A.


This gets at the role of journals vs. blogs, wikis, and discussion boards. I see online resources as more dynamic and up to date, and better encouraging discussion. One thought is to convert my original post into a wiki, which is implemented in the software we are using here; I just haven’t had experience with such a conversion.


In a clinical setting this statement makes sense. The problem is that the applicability of statistics to a vast array of issues is being realized in a far broader context today, the rise of “data science”, reporting and reasoning on public issues has led to a huge interest in statistical analysis of data and it’s interpretation. This leads to external exposure of statistical terms outside of clinicians, researchers and statisticians. This group of people already have a firmly established idea of what significant means–and it doesn’t match the statistical definition. Better education would solve the problem, but I don’t see it happening.

What’s needed is a word that better corresponds to between common and statistical definition and it had better be very close because narrower subtleties are arguably worse. The closest I’ve come up with is ‘discernable’. Think about how the following statements are perceived to someone with no statistical training: “The study found no discernable difference, but may have been under powered to find any.”, “The study found a discernable difference between the two groups”, “At (n=XXX) This study will be powered to discern a difference of XXX between groups assuming variance is XXX or less”

“It’s what they say they heard.” --Barry Nussbaum

Then the phrases “Statistical discernable” and “Clinical significant” don’t seem to lead to as much confusion.


As a clinician researcher, I struggle with some of the suggested interpretations here. E.g. if a clinical trial is adequately powered for what is deemed to be a minimal clinically important difference and the results are found to be “not statistically significant,” why would it be incorrect to say that there is no significant difference between treatments? In the BP example, if a meaningful difference was deemed to be 15mmHg, and the results demonstrated -4 [95%CI: -13, 5], have you not established that your study does not support the use of the new treatment?

Sure, there could be a smaller difference in favour of one of the interventions, and a Bayesian re-analysis could provide you with the probability of there being any difference, but if those differences are smaller than what is clinically meaningful, why should we be emphasizing their importance? This seems counter to the movement towards defining results based on MCID


Your excellent questions raise a number of important issues. Here are some thoughts.

  • Power is a pre-study issue and should be largely ignored post-study. I’ve never been involved in a power calculation that wasn’t gamed in some way, usually modifying the effect to detect (MCID) to meet a budget constraint. The easiest way to see that power should be ignored post-study is when the standard deviation used in the calculation turns out to be much smaller than the observed SD at the end.
  • In some fields, e.g. orthopedics and sports medicine, the way MCIDs are derived is a joke.
  • Most researchers look at the compatibility interval by itself to make judgments about clinical usefulness without reference to the MCID, because readers often have a different value in mind for MCID from the one used by the study’s budget director. But you’re correct that whomever has an MCID in mind can see if the compatibility interval includes it.
  • Not statistically significant and no significant difference between treatments on their own convey no useful information. If they did, you would not need to randomize more than 2 patients. These terms also invite the absence of evidence error. Putting these in the context of power would have helped had power been useful post-study.
  • IMHO we need to move to Bayesian primary analysis, not re-analysis. Under a given agreed upon prior distribution for the treatment effect, you can compute the posterior probability that the SBP difference is in the right direction, and that it is more impressive than -2, -4, -8, -10, etc. To me this is getting much more directly at the question. Compatibility intervals set long-run frequencies and backsolve for effect limits. Bayesian posterior probabilities let the researcher set any number of effect sizes and compute probabilities of those
  • All of your questions at their core reflect our severe allergy to formal decision analysis. If all pertinent parties could agree on the utility function and the utility function included SBP, side effects, and cost, one can make the formal Bayes decision that maximizes expected utility. That is really what should drive everything else, especially to Bayesian/parameter space folks like me. Since we are not brave enough to go on record with a utility function, we can at least base the analysis on the inputs to the expected utility calculation—the posterior distribution for treatment effect.


Pure gold! Crystalized thus, this principle seems right-sized for uptake by clinicians. I can see this being used on teaching rounds to great effect.


I know this is meant as a shortcut, but it is very important to remind users, in each statement, that interpretation is conditional on an extended model, which contains a hypothesis and many assumptions about experimental design, measurements, estimators, and so on. So in this case, the statement might be:

“We found limited evidence against our extended model, which includes the hypothesis that A=B (p=0.4)”.

This is crucial, because the p value in many situations has very little relationship to the hypothesis, but is strongly influenced by one or more assumptions. Trivial but common cases include using inappropriate statistical tests and estimators not robust to outliers. I don’t think it is a good idea to focus exclusively on small sample sizes as an area of improvement.


Well said. My only suggestion is the drop the word ‘extended’ but make it clear the the model includes several aspects including the form of the mean, how dispersion is captured and whether dispersion varies with type of subject, sometimes the normality of residuals, patient outcomes being independent of each other, absence of interactions …


If you write an article about these language guidelines (which would be very useful), then a paragraph would be devoted to explaining the meaning of “model” to avoid repetitions. This aspect is mostly ignored in my research fields, where mindless dichotomania is the norm. The best description of the “extended model” I’ve come across is from Paul Meehl.


Great post!

You write “0.95 of such varying confidence intervals”.

Wouldn’t it be clear to write “95% of such varying confidence intervals”?


Some people prefer to multiply probabilities by 100; some don’t. I’m in the latter category as discussed here. Note that probabilities are always between 0 and 1, so if you discuss 100 x probabilities you need another word, to be picky.


what about the increasing use of the ‘fragility index’ for communicating frequentist results? i don’t see biostatisticians pushing back against this, but it seems, instinctively, an awful idea


You can’t keep putting a bandaid on a deep wound and hoping for a cure …


I see. It’s just that this specific wording looks unfamiliar to me (e.g., writing “0.2 of the US population” to mean “20% of the US population”). Not a bad idea, but not standard language to me.

But I understand the point against percentages. I’ve also complained against them


I agree about the Fragility Index. I wrote some thoughts about it here

(originally posted on 24th June 2016, hence the comment at the start; since moved to a different site).