I don’t think we need guidelines as much as a few resources that discuss this more in depth with the problematic usage and history of the commonly used terms/phrases.
I do wonder if Ron would be open to a collaboration (it doesn’t have to ASA-related) but something to create a resource where common usage of phrases/terms is discussed.
I appreciate and believe much of the discussion here has been incredibly insightful but I believe it’s also incredibly restricted and only visited by those who are already interested in proper statistics. Some of the problems that have been brought up here, such as the damage of the usage of the term “significance” and “confidence” have not really been brought up elsewhere.
For example, the ASA statement and its supplements, including @Sander’s, discuss some of the problems with dichotomization and misinterpretations, and Sander’s recent papers on P-value behavior also discuss some of the discrepancies in the definition of “significance” etc, but it doesn’t seem an argument (similar to the one in this forum by @f2harrell or @Sander) on why usage of these terms is detrimental to statistics and science has been fleshed out in any sort of paper that can actually be referenced.
If there is one, I would appreciate a link. I think this is something worth thinking about because if I were to go on a universal resource like Wikipedia and attempt to make some edits to add the phrase “compatibility interval” to the entry on confidence intervals, it is very likely that my suggestions would get rejected without some sort of authoritative reference.
I can get behind these suggestions and completely avoid the term “significance” (even in a hypothesis testing framework) by sticking to “xx hypothesis is/is not rejected by our prespecified criterion” and behind “compatibility intervals.” I’ve edited my posts near the top of the forum to reflect these thoughts.
I’m very late to this party, but still want to offer my 2 cents.
95% confidence does not seem to be such a useful property because it refers to the procedure rather than the observed data. So it’s difficult to fault researchers for interpreting confidence intervals as if they are credibility (or posterior) intervals. What else are they supposed to do with them?
Better terminology is certainly important, but I think it’s even more important to make better intervals.
Not at all with my collaborations over the years. A fundamental reason why there is no 1-1 equivalence between Bayes and frequentists is many situations is that there is no agreement among frequentists in how to handle multiplicity. Areas in which 1-1 equivalence is impossible or at least very difficult are in sequential testing and other situations with stochastic N, and multiple outcomes. For multiple outcomes, e.g., a Bayesian posterior probability that a treatment makes a major improvement on two specific outcomes, a moderate improvement on one other target, or any improvement in mortality, it’s easy to say that there is an equivalence but no one has shown how to derive it.
That’s not correct in my opinion. The big new assumption of Bayes is the prior, and prior is in plain site and criticizable. You are avoiding decision making and not recognizing how concordant Bayesian posterior inference is with decision making. Sample space considerations are very hard to be transparent about, and often very hard to take into account. The idea that frequentists can make coherent decisions with calculations about data is a hard sell. This whole argument is caused by fundamental problems in indirect reasoning.
I like everything else you said.
Let’s stop the Bayesian debate at this point and try to come to the best language for communicating and interpreting frequentist results that might be used in the medical literature. Your feedback on the “more agnostic” edits I made yesterday would help us all move forward in getting more reasonable language.
The Bayes-frequentist debate enters the terminology debate especially when people start doing uncritical promotion of one or the other, because as I wrote before the dichotomy is disconnected from the details. The claims you put forth not only ignore abundant frequentist criticisms, but also disregard the vast diversity and divergences of methods and “philosophies” (or narrow dogmatisms) falling under the headings of “frequentist” and “Bayesian”. Fisherian frequentism in its most likelihoodist form is for all practical purposes hardly distinguishable from reference Bayes, and both of those hardly resemble the extremes of Neymanian hypothesis testing or operational subjective Bayesianism. Furthermore, the latter extremes are united by many connections (e.g. admissibility of Bayes estimators).
I’ll continue to state that various claims of this or that being superior are misleading if they are not circumscribed by the purpose and context of the analysis. Claims that Bayesianisms (whatever their form) are uniformly desirable are like claiming sports cars are simply the best cars; fine if you are on the autobahn, but see how far that gets you on a rutted dirt road. Likewise, where a Bayesian method and a frequentist method diverge, it can be the Bayesian method which is the one leading us into disastrous error - that was noted by Good long ago even though he was philosophically Bayesian.
What some of us know now is that Bayes can break down without the prior being wrong in any obvious public sense - see the literature on sparse-data inference; as I’ve cited before, there is a lovely summary on page 635-6 of Ritov Bickel et al. JASA 2014. It’s all just a manifestation of the NFLP (No-Free-Lunch-Principle): Every advantage you list (when real) is bought by some disadvantage elsewhere.
So again, regarding terminology, both “frequentist” and “Bayesian” need to be reoriented to talking about methodologic tools, not philosophies or dogmas or dictates. When we do that we can find that we can use P-values to check Bayesian analyses, Bayesian posteriors to check compatibility intervals, etc. By doing both we can see when conventional P-values aren’t close to posterior probabilities and their corresponding compatibility intervals aren’t close to posterior intervals, because we have them both before us at once to compare. And we can then see why we should reject highly subjective posterior-evoking terms like “confidence interval” or “uncertainty interval” to refer to the range between level points on a P-curve, and why we should reorient frequentist terms and descriptions to emphasize the complementarity of frequentist and Bayesian tools.
Zad: Researchers tend to interpret CIs as if they’re credibility intervals. And that’s correct if one assumes the uniform (flat) prior. So for this reason, and because of certain invariance properties, the uniform prior is considered by many to be objective or non-informative. The trouble is that the uniform prior it’s actually very informative both for the sign and magnitude of the parameter. When estimating a regression coefficient (other than the intercept), that’s a real problem because zero has the special importance of “no effect”.
If the parameter you want to estimate is a regression coefficient (other than the intercept) in a bio-medical context, than that’s prior information in itself. I sampled 50 random papers from MEDLINE to estimate a prior that can be used as a default in that situation.
Now suppose you have an (approximately) normal and unbiased estimate B of a regression coefficient with standard error se. The default prior that I got from MEDLINE gives B/2 +/- 1.96*se/sqrt(2) as a suitable 95% default credible interval.
The example of Prof. Harrell has a difference of mean blood pressure, which can be viewed as a regression coefficient in a bio-medical context. The p-value is 0.4 and the 0.95 CI is [-5, 13]. So, the estimate is 4 and its standard error is 4.5. The default credible interval I’m proposing would be [-4.2, 8.2]. The default posterior probability that the true difference is positive would be 1 - pnorm(0,B/2,se/sqrt(2))=0.73.
I would say it isn’t correct (see earlier posts). The interpretation is different even if the limits of the interval are the same. Things don’t become the same just by being the same size. My rabbit weighs the same as my cat.
would say it isn’t correct (see earlier posts). The interpretation is different even if the limits of the interval are the same. Things don’t become the same just by being the same size. My rabbit weighs the same as my cat.
Simon: In my experience, researchers tend to interpret the “0.95” of a CI as a degree of certainty in much the same way they would interpret the “0.5” of a fair coin toss. Don’t you think so?
I agree that people interpret it that way. I’m just saying that the interpretations of frequentist confidence intervals and Bayesian credible intervals are not the same, even if they are the same size.
True. But my point is that when people interpet CIs that way, then they’re (implicitly or explicitly) assuming the flat prior. However, that prior is often completely unrealistic. In particular, this is the case when estimating a regression coefficient (other than the intercept) in the context of the life sciences.
Since people are always going to interpret interval estimates as probability statements, I think that those intervals should be based on reasonable priors.
The compatibility interpretation of P-values and their derived “confidence” intervals goes way back. I haven’t done a historical study but I think DR Cox mentioned it long ago somewhere (probably where I got it) and the idea if not the word can be found in his, Good’s, and some of Fisher’s writings on the meaning of “significance levels” (P-values, not alpha levels). “Compatibility” was thus used a fair bit in
Greenland, S., Senn, S.J., Rothman, K.J., Carlin, J.C., Poole, C., Goodman, S.N., and Altman, D.G. (2016). Statistical tests, confidence intervals, and power: A guide to misinterpretations. The American Statistician, 70, online supplement 1 available free at http://amstat.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108/suppl_file/utas_a_1154108_sm5368.pdf
which was reprinted in the European Journal of Epidemiology 2016, 31, 337-350 and available free from that website as well. The evidential interpretation of compatibility using the Shannon information measure (S-value) s = log2(1/p) = -log2( p ) is discussed on p. 642 of
Greenland, S. (2017). The need for cognitive science in methodology. American Journal of Epidemiology, 186, 639-645, available free at
see also p. 10 in this forthcoming article (posted earlier):
Amrhein, V., Trafimow, D., and Greenland, S. (2018). Inferential statistics as descriptive statistics: There is no replication crisis if we don’t expect replication. The American Statistician, 72, available free at https://peerj.com/preprints/26857/
Further details about compatibility, refutational information, and S-values are in this forthcoming article:
Greenland, S. (2018). Some misleading criticisms of P-values and their resolution with S-values. The American Statistician, 72, which will be free as well.
Then I should hope you both are in full agreement with renaming “confidence intervals” to something less likely to evoke posterior intervals, like “compatibility intervals”.
More generally, there needs to be an appreciation of how ill-considered some of these traditions and terms have turned out to be, and less hesitance about using reformed concepts and terms (while making clear the mapping from the traditional to the reformed).
Sander: I’m not so sure. I’m assuming the term “confidence” was originally intended stay away from probability, chance or anything like that. Well, that didn’t work!
I’m afraid people will always interpret interval estimates as probability statements, no matter what they’re called. While better terminology is certainly important, I believe that better intervals are more important. In my opinion B/2 +/- 1.96 SE/sqrt(2) is a lot better than B +/- 1.96 SE. At least for regression coefficients (other than the intercept) in the context of the life sciences.
Unclear: You say “I’m not so sure” and then “Well, that didn’t work!”
There is no way to be sure about any of this, but it seems pretty damn clear that just trying to get people to use the traditional terms properly (which for “confidence” has been underway since at least Rothman’s editorials in the 1970s, when I was a student!) has failed utterly, spectacularly, and miserably, and that a sure fire way to prevent improvement is to stay that course.
This blog page is about ways to move forward in verbal description of basic concepts, so do you have a terminological (not methodological) suggestion that you can argue is better than replacing “confidence” with “compatibility”?
I agree with this statement. It seems most of the focus has been on communicating results into language. It may also be helpful to provide a guide that does the opposite, by taking commonly used (or misused) linguistic statements and providing examples of corresponding statistical modalities. Even if commonly used language does not actually have any meaning in frequentialist statistics, pointing this out will still help to educate and avoid error.