Yes, I’m fully in agreement with that!
The compatibility interpretation of P-values and their derived “confidence” intervals goes way back. I haven’t done a historical study but I think DR Cox mentioned it long ago somewhere (probably where I got it) and the idea if not the word can be found in his, Good’s, and some of Fisher’s writings on the meaning of “significance levels” (P-values, not alpha levels). “Compatibility” was thus used a fair bit in
Greenland, S., Senn, S.J., Rothman, K.J., Carlin, J.C., Poole, C., Goodman, S.N., and Altman, D.G. (2016). Statistical tests, confidence intervals, and power: A guide to misinterpretations. The American Statistician, 70, online supplement 1 available free at
which was reprinted in the European Journal of Epidemiology 2016, 31, 337-350 and available free from that website as well. The evidential interpretation of compatibility using the Shannon information measure (S-value) s = log2(1/p) = -log2( p ) is discussed on p. 642 of
Greenland, S. (2017). The need for cognitive science in methodology. American Journal of Epidemiology, 186, 639-645, available free at
- see also p. 10 in this forthcoming article (posted earlier):
Amrhein, V., Trafimow, D., and Greenland, S. (2018). Inferential statistics as descriptive statistics: There is no replication crisis if we don’t expect replication. The American Statistician, 72, available free at
Further details about compatibility, refutational information, and S-values are in this forthcoming article:
Greenland, S. (2018). Some misleading criticisms of P-values and their resolution with S-values. The American Statistician, 72, which will be free as well.
Then I should hope you both are in full agreement with renaming “confidence intervals” to something less likely to evoke posterior intervals, like “compatibility intervals”.
More generally, there needs to be an appreciation of how ill-considered some of these traditions and terms have turned out to be, and less hesitance about using reformed concepts and terms (while making clear the mapping from the traditional to the reformed).
Definitely, “confidence interval” is a terrible term and is probably a major cause of misunderstanding.
Sander: I’m not so sure. I’m assuming the term “confidence” was originally intended stay away from probability, chance or anything like that. Well, that didn’t work!
I’m afraid people will always interpret interval estimates as probability statements, no matter what they’re called. While better terminology is certainly important, I believe that better intervals are more important. In my opinion B/2 +/- 1.96 SE/sqrt(2) is a lot better than B +/- 1.96 SE. At least for regression coefficients (other than the intercept) in the context of the life sciences.
Unclear: You say “I’m not so sure” and then “Well, that didn’t work!”
There is no way to be sure about any of this, but it seems pretty damn clear that just trying to get people to use the traditional terms properly (which for “confidence” has been underway since at least Rothman’s editorials in the 1970s, when I was a student!) has failed utterly, spectacularly, and miserably, and that a sure fire way to prevent improvement is to stay that course.
This blog page is about ways to move forward in verbal description of basic concepts, so do you have a terminological (not methodological) suggestion that you can argue is better than replacing “confidence” with “compatibility”?
Unclear: You say “I’m not so sure” and then “Well, that didn’t work!”
I meant to say I’m unsure if using the term “compatibility” will improve matters. I’m quite sure the term “confidence” did not work out well!
This blog page is about ways to move forward in verbal description of basic concepts, so do you have a suggestion that you can argue is better than replacing “confidence” with “compatibility”?
I don’t have a better term. I do think I have a better interval, but that may be off-topic.
I agree with this statement. It seems most of the focus has been on communicating results into language. It may also be helpful to provide a guide that does the opposite, by taking commonly used (or misused) linguistic statements and providing examples of corresponding statistical modalities. Even if commonly used language does not actually have any meaning in frequentialist statistics, pointing this out will still help to educate and avoid error.
I suspect that many journal article readers will want the interpretation/judgement of the study authors, and preferably a judgement that takes into account existing knowledge, because the authors would often know more about the research question than the reader.
Also, just stating the results may instead leave readers making the interpretation they would have made had it been their results, such as “no association” if a CI includes zero.
As a suggestion, perhaps if study authors were persuaded to consider alternative explanations for their results then the language would take care of itself? If there were sufficient incentives to report at least one or two, which would also ensure that they thought of one or two, it would seem unlikely that a strongly confident conclusion could be presented alongside an alternative explanation based on very plausible confounding.
Tim: Removing positive/negative judgments does not mean avoiding any kind of judgement by the study authors. It just means removing dichotomous decisions. The study authors should discuss “how strong they judge their evidence not only based on the width of an interval estimate, but also in view of possible shortcomings of their study, of their prior knowledge about other studies (…), and of possible costs of their interpretation for the health of the patients” (quote from Amrhein, Trafimow & Greenland cited above).
The risk that readers make the ‘absence of evidence is not evidence of absence’ error should be reduced by explicitly stating (and discussing!) the lower and upper limits of the compatibility interval, instead of just giving values in brackets. I suggest extending Frank’s “Safe Statement of Results Agnostic to Positive/Negative” as follows:
“Treatment B was observed in our sample of n subjects to have a 4mmHg lower mean SBP than treatment A with a 0.95 2-sided confidence interval of [-5,13]. The confidence interval indicates a wide range of plausible true treatment effects.”
Version reducing the ‘absence of evidence is not evidence of absence’ error risk:
“Treatment B was observed in our sample of n subjects to have a 4mmHg lower mean SBP than treatment A. Possible differences between treatments that are highly compatible with our data, given our model, ranged from -5 (treatment B having a 5mmHg lower mean SBP than treatment A) to 13 (treatment B having a 13mmHg higher mean SBP than treatment A; values are limits of the 95% compatibility interval).”
(Not sure I interpreted the effect sizes correctly; should not the point estimate be B-A = -4mmHg? Then the interval [-5,13] looks a bit weird).
Thanks Valentin. Yes, language is a tricky thing. I agree that making strong yes/no decisions based on a P-value cutoff is often poor science.
However, I would suggest that a judgement about the existence of a treatment effect will either be that:
a. the study found no useful evidence
b. the evidence supports the existence of the treatment effect - even if the support is very weak it might still be labelled ‘positive’
c. the evidence supports the non-existence of the treatment effect - even if the support is very weak it might still be labelled ‘negative’
It was in this sense that I interpreted the words positive and negative, with Frank seeming to suggest that such decisions could be left the reader.
But I’m not sure what judgement your ‘Version reducing the ‘absence of evidence is not evidence of absence’ error risk’ is suggesting? It seems to be just stating the numbers with no explicit judgement?
I have done a poor job regarding consistency of language and the direction of the B-A subtraction of means. I’ll edit my earlier posts to use -4mmHg as the point estimate of efficacy, reflecting B is better than A by the observed data.
A problem remains: if the compatibility interval is extremely wide, the more correct answer is “a. the study found no useful evidence” but we would see most authors instead clinging to b. or c.
Tim, thanks for your answer, but my above suggestion for a formulation explicitly making use of the compatibility interval was for the results section, not for the discussion. As a Zoologist who did not author this study on a medical treatment, I’ve no idea about possible shortcomings of the study, about prior information on the treatments, and about possible costs for patients.
Whether the evidence is considered ‘useful’ depends on all those factors, and on many more. If I were a patient with a serious need for treatment, if this were the only study available, and if the study generally looked well done, what would I infer from a point estimate of 4 (leaving aside my confusion about the sign of the estimate) and from a compatibility interval of -5 to 13? I would probably take the point estimate as our best bet for what we want to know, look how bad a treatment effect of -5 would possibly be, and hope for an effect of 4 or even up to 13.
My point is:
Let us free our “negative” results by allowing them to be positive contributions to knowledge.
Let us stop using single studies to speculate on the mere existence of an effect.
If we can infer anything at all from single studies, then we should do it by speculating about the size of an effect.
References on those points can be found in the following review, in which we also tried some formulations about different compatibility intervals (which, oddly enough, we used to call confidence intervals at that time):
Indeed, I’m not sure if I’ve ever seen a study with a ‘no useful evidence found’ conclusion.
An extremely wide compatibility interval would certainly suggest to me that just about any answer to the research question was possible from the data at hand. Hence, ‘no useful evidence was found’ is probably the best answer. But I suspect this would be embarrassing to publish, partly because it suggests either a lack of data, pointing to poor study planning, or the method of analysis was very inappropriate. I’m yet to come across such wide compatibility intervals, however, so my relevant experience may be lacking.
But if the compatibility interval is not wide, yet the P-value is high, would you still opt for ‘no useful evidence was found’?
As alluded to earlier by Sander, a substantial part of the problem lies in how our brains work. However, I want to first note that while I’m currently finishing my biostatistics PhD thesis, I’m yet to publish anything, which makes me a somewhat unknown source of information. Nevertheless, in my thesis I make some use of the cognitive science literature and (the difficulties of generating evidence in a field like psychology notwithstanding), there seems to be plenty of support for the idea that people have a built-in, and often not conscious, preference or need for causal explanations. From an evolutionary point of view: in order to respond appropriately to our environment, it needs to make sense to us.
This may partly explain why many researchers believe a high P-value is best explained by “the null hypothesis of no effect (or no difference) is true”. It is, perhaps, the only ‘obvious’ explanation for why a treatment effect, or difference in treatment effects, appeared close to zero. Hence, getting researchers to think of other plausible reasons that might ‘explain’ the result may be the best way to encourage a greater sense of uncertainty.
Thanks for clearing that up. My focus was on the abstract, and especially the final conclusion.
I agree with everything in your post, though I will need to read through your papers to get a broader understanding of the points you are making.
Yes it would be embarrassing because it would be true. The study then becomes a pilot study in effect.
I wouldn’t say “no useful evidence” but something more like “While the study provided some evidence against the supposition of no treatment effect, it did not shed much light on the magnitude of the effect.”
Not at all - we desperately need ideas from those not “tainted” by decades of practice.
Frank, just a gentle correction: I think it should be “While the study provided little evidence against the supposition of no treatment effect, it did shed some light on the magnitude of the effect.”
Sorry; I meant my comment to apply to the “low p-value wide interval” case.
Is this expressing a belief that most hypothesised causal effects cannot be exactly zero? It’s a view that has only been raised by a few authors in the literature, e.g., Nester, Tukey, Gelman, yourself, and I also find it compelling.
This topic is getting very long, however, so I might raise it as a new topic to see if anyone has good arguments against it.