Language for communicating frequentist results about treatment effects



Is this expressing a belief that most hypothesised causal effects cannot be exactly zero?

I don’t think it’s necessary to talk about causality, but I do agree that associations (in particular in the life sciences) are unlikely to be zero. That is very important, because it would mean that the focus on the presence or absence of an association is mistaken. Rather, we should be talking about the direction and magnitude, as Gelman suggests with his type S and M errors.

This topic is getting very long, however, so I might raise it as a new topic to see if anyone has good arguments against it.

Great idea!


I agree, but this was not my point: Any compatibility interval covering the null value (and hence p > 0.05), and any interval not covering the null value, “sheds some light on the magnitude of the effect.” An interval covering the null only shows that the null hypothesis is one of many different hypotheses that are compatible with the data, given the model.

Indeed, I would say that any compatibility interval that is not covering the entire range of possible values “sheds some light on the magnitude of the effect.” How useful the obtained information is depends, among other things, on where and how wide the interval is relative to the range of effect sizes we would judge interesting or important.


Not assuming a zero effect would probably change the language used for the results, at least sometimes. And probably inferences sometimes too. I’ll set up a ‘zero effect’ topic soon.

But I think mentioning causality would also be useful. Most research questions are causal, including any that assess ‘treatment effects’, because ‘effect’ implies a causal mechanism. And people seem to understand the world in terms of cause and effect. For example, perhaps the only way we can understand a non-causal association between variables in a study is by thinking of other causes that could produce the association, either through confounding or collider bias.

And the word ‘direction’ in Gelman’s suggestion likewise would apply to a causal question, because associations have no direction … actually, I’ve not interpreted the word ‘direction’ accurately. The word Gelman used was ‘sign’, which is probably what you meant by ‘direction’. He does use the term ‘direction’ in his article with Carlin (DOI: 10.1177/1745691614551642), but he means up or down (positive or negative). On the other hand, Gelman and Carlin do talk about ‘direction of effects’, suggesting they are, at least, thinking of causes.


Yep, that’s what I meant.


Those are good suggestions.


Fascinating discussion and a well-spent hour to start the morning. It made me wonder if it would be at all reasonable to consider a type of commentary manuscript in which one simply re-wrote the discussions from published articles in which the authors incorrectly interpreted their statistics. Or perhaps a style of review article in which, rather than simply summarizing the conclusions made by authors there was actually an attempt to critically reinterpret their reported data to make more accurate statements about the evidence supporting B versus A.


This gets at the role of journals vs. blogs, wikis, and discussion boards. I see online resources as more dynamic and up to date, and better encouraging discussion. One thought is to convert my original post into a wiki, which is implemented in the software we are using here; I just haven’t had experience with such a conversion.


In a clinical setting this statement makes sense. The problem is that the applicability of statistics to a vast array of issues is being realized in a far broader context today, the rise of “data science”, reporting and reasoning on public issues has led to a huge interest in statistical analysis of data and it’s interpretation. This leads to external exposure of statistical terms outside of clinicians, researchers and statisticians. This group of people already have a firmly established idea of what significant means–and it doesn’t match the statistical definition. Better education would solve the problem, but I don’t see it happening.

What’s needed is a word that better corresponds to between common and statistical definition and it had better be very close because narrower subtleties are arguably worse. The closest I’ve come up with is ‘discernable’. Think about how the following statements are perceived to someone with no statistical training: “The study found no discernable difference, but may have been under powered to find any.”, “The study found a discernable difference between the two groups”, “At (n=XXX) This study will be powered to discern a difference of XXX between groups assuming variance is XXX or less”

“It’s what they say they heard.” --Barry Nussbaum

Then the phrases “Statistical discernable” and “Clinical significant” don’t seem to lead to as much confusion.


As a clinician researcher, I struggle with some of the suggested interpretations here. E.g. if a clinical trial is adequately powered for what is deemed to be a minimal clinically important difference and the results are found to be “not statistically significant,” why would it be incorrect to say that there is no significant difference between treatments? In the BP example, if a meaningful difference was deemed to be 15mmHg, and the results demonstrated -4 [95%CI: -13, 5], have you not established that your study does not support the use of the new treatment?

Sure, there could be a smaller difference in favour of one of the interventions, and a Bayesian re-analysis could provide you with the probability of there being any difference, but if those differences are smaller than what is clinically meaningful, why should we be emphasizing their importance? This seems counter to the movement towards defining results based on MCID


Your excellent questions raise a number of important issues. Here are some thoughts.

  • Power is a pre-study issue and should be largely ignored post-study. I’ve never been involved in a power calculation that wasn’t gamed in some way, usually modifying the effect to detect (MCID) to meet a budget constraint. The easiest way to see that power should be ignored post-study is when the standard deviation used in the calculation turns out to be much smaller than the observed SD at the end.
  • In some fields, e.g. orthopedics and sports medicine, the way MCIDs are derived is a joke.
  • Most researchers look at the compatibility interval by itself to make judgments about clinical usefulness without reference to the MCID, because readers often have a different value in mind for MCID from the one used by the study’s budget director. But you’re correct that whomever has an MCID in mind can see if the compatibility interval includes it.
  • Not statistically significant and no significant difference between treatments on their own convey no useful information. If they did, you would not need to randomize more than 2 patients. These terms also invite the absence of evidence error. Putting these in the context of power would have helped had power been useful post-study.
  • IMHO we need to move to Bayesian primary analysis, not re-analysis. Under a given agreed upon prior distribution for the treatment effect, you can compute the posterior probability that the SBP difference is in the right direction, and that it is more impressive than -2, -4, -8, -10, etc. To me this is getting much more directly at the question. Compatibility intervals set long-run frequencies and backsolve for effect limits. Bayesian posterior probabilities let the researcher set any number of effect sizes and compute probabilities of those
  • All of your questions at their core reflect our severe allergy to formal decision analysis. If all pertinent parties could agree on the utility function and the utility function included SBP, side effects, and cost, one can make the formal Bayes decision that maximizes expected utility. That is really what should drive everything else, especially to Bayesian/parameter space folks like me. Since we are not brave enough to go on record with a utility function, we can at least base the analysis on the inputs to the expected utility calculation—the posterior distribution for treatment effect.


Pure gold! Crystalized thus, this principle seems right-sized for uptake by clinicians. I can see this being used on teaching rounds to great effect.