Language for communicating frequentist results about treatment effects


Sander, what language would you prefer to use in research settings where the primary purpose of the analysis is to make a decision based on a comparison of observed p to a justified α?

In much of cell biology (basic biomedical but far from clinical), researchers state alpha up front but then do not seem to actually use it as a hard decision rule. In comparing contemporary papers with those from before 2000 (and especially before 1990) the things that seem to guide decision making are no different today than 30 years ago. The difference is that in today’s papers stuff is more quantified in a way to do a t-test or anova (like the results of various blots) and there are 100s of reported n.s., *, **, and *** (and even ****) per paper, compared to no asterisks and no t-tests (or maybe 1 or 2) 30 years ago. Even today, the “decisions” that the researchers are making do not seem to be guided much by the statistics. Basically, if it looks like an effect (based on the plot of the data), and the direction is in a direction of their working model of how the system works, they probe a little deeper with different experiments.

So even though these researchers state an alpha, in a sense, the practice is a bit more like Fisher than N-P (since they are not using p-values as hard decision rules) but instead of reporting p-values, they’ve quart- or quint-chotomized the p-value into n.s., *, **, ***, and ****.

Back to the original question - most of this is irrelevant in cell biology because a typical paper reports > 100 p-values or sets of asterisks and for the most part, they are simply reported and not explicitly interpreted. So how could the reporting in cell biology papers be improved?

  1. drop the initial statement of alpha
  2. report p-value and not asterisk
  3. drop all uses of the word “significant”


Bayesian-frequentist mapping is only available in specific situations, and even when the mapping is simple I don’t like to think this way because frequentists do not like the Bayesian interpretation.

Frequentist methods cannot handle this in general, because hierarchical models do not cover the full spectrum of models. Compare for example with copula multivariate models for multiple outcomes of different types, where the types cannot be connected by a common random effects distribution without at least having scaling issues.

I think our fundamental disagreement, which will not be fixable no matter how long the discussion, is that you feel you are in a position to question any choice of prior distribution, and you are indirectly stating that a prior must be “right”. I only believe that a prior must be agreed upon, not “right”, and I believe that even a prior that is a bit off to one reviewer will most often provide a posterior distribution that is defensible and certainly is more actionable.

Your point about Bayesian analysis depending on prior + model is of course correct. But Bayes in total makes fewer assumptions than frequentist. That’s because frequentist requires a whole other step to translate evidence about data to optimal decisions about parameters. This requires many assumptions about sample space that are more difficult to consider than changing a Bayesian prior. But let’s stick to optimizing frequentist language here for now.



I’d walk this back a even further. Whether the prior is “agreed upon” is far less important than that it constitutes objective knowledge which is therefore criticizable. That is, an explicit prior enriches and enlarges the critical, scientific discussion we can have together. I suspect that one reason this convo is not moving toward resolution is that it has been framed through a question about alternative ways to make statements that have precisely the opposite character—i.e, statements about statistical ‘results’ that are meant to close off critical discourse!



This is brilliant and extremely useful. Just one suggestion. In the negative case, you could make the correct statement more general by expanding this sentence:
“More data will be needed.”
“More data, or more accurate data or more relevant data will be needed.”

  • accurate data to emphasise measurement precision
  • relevant data to emphasise the link between data and the hypothesis being tested
1 Like


David I agree with that for the majority of cases. But not all. I think that evidence that is intended to convince a skeptic can validly use a skeptical prior that incorporates no prior information. This is essentially Spiegelhalter’s approach. I do want to heartily endorse the criticizable part.

For sure - and I think that claiming ‘statistical significance’ or lack thereof in a medical paper is a great way to shut down thinking!

1 Like


Zad: In that situation how about reporting the decision rule with α in the methods section, then the P-value and decision in the results, as recommended (for example) by Lehmann in the mid-20th century bible of NP theory, Testing Statistical Hypotheses? Many authors in that (my grandparents!) generation used the term “significance” to describe p (Fisher, Cox) and others (like Lehmann) used it to describe α, so presciently enough Neyman when I knew him avoided using the word for either. I avoid it, Frank avoids it, and anyone can: p is the P-value, α is the alpha-level, and the decision is reported as “rejected by our criterion” or “not rejected by our criterion” (please, not “accepted”!). Even in that decision situation, Lehmann, Cox and many others advised reporting p precisely, so that a reader could apply their own α or see how sensitive the decision was to α.

As for “uncertainty interval”, sorry but no, because:

  1. that is already used by some Bayesians to describe posterior intervals, and rightly so because
  2. it’s yet another misuse of a word: “uncertainty” is a subjective observer state, whereas the P-value and whether it is above or below α (as shown by the interval for multiple points) is just a computational fact, regardless of anyone’s uncertainty; in fact
  3. no interval estimate I see in my field captures anywhere near the uncertainty warranted (e.g., warranted by calibration against a credible model) - they leave out tons of unknown parameters. That means those “left out” parameters have been set to default (usually null) values invisibly to the user and consumer. (In my work there is typically around 9 unknown measurement-error parameters per association; only on rare occasions do even a few get modeled.)

At best a compatibility (“confidence”) interval is just a summary of uncertainty due to random error alone which is to say the conditional uncertainty left after assuming the model used to compute the interval is correct (or wrong only in harmless ways). But that’s an assumption I would never believe or rely on in my work, nor in any example from Gelman’s area.

My conclusion is that using “uncertainty” in place of “confidence” is just substituting a bad choice for a horrific one.



So I’m assuming from your response that you’d be fine with describing them as compatibility intervals?



Yes, until somebody convinces me that term is misleading for nontechnical readers relative to the ordinary English meaning of “compatibility” (compared to the extremely misleading evocations of “confidence” and “uncertainty”).



We need some sort of ASA guidelines on the language for reporting/interpreting statistical results.



The “specific situations” cover every single problem I encounter in med and health controversies. Every linear, logistic and Cox regression I see has at least a partial-Bayes analog. Conversely every single Bayesian analysis is directly mappable to a random-parameter frequency model, which can be used to calibrate any proposed posterior computation via simulation from that model.

My impression is that, for some purely psychosocial reasons, extremists on both sides ignore or forget these mappings, which again apply to all the situations I think anyone here (struggling with mere terminology) encounters in real research (not flashy specialty apps in JASA or Biometrika).

My impression is that this failure on both sides is what split statistics into this crazy frequentist/Bayes divide. And that this split has been beyond destructive to thinking , writing, and teaching clearly (inseparable goals) about applied statistics. If the study is at all important and its goal is clear communication of what was observed and what it might portend, both frequentist and Bayesian interpretations need to be brought up. In doing so my ideal (not always attainable perhaps) is to develop an analysis model that can be defended as a useful summary of what is known, both for information extraction and prediction, and thus be acceptable from broad frequentist and Bayesian views. Meaning I listen closely to extremist criticisms of the other side even though I don’t buy their claims of owning the only correct methodology…

But Bayes in total makes fewer assumptions than frequentist. That’s because frequentist requires a whole other step to translate evidence about data to optimal decisions about parameters. This requires many assumptions about sample space that are more difficult to consider than changing a Bayesian prior.
That’s all just flat wrong in my view. Top frequentists I know say just the opposite, e.g., that Bayesian methods add crucial assumptions in the form of their priors and hide all the sampling-model complexities from the consumer in a black box of posterior sampling. Plus to top it off, Bayesian conditioning discards what may be crucial information that the model is wrong or hopelessly oversimplified (see Box onward).

I see all those problems, and I have seen some Bayesians in denial about or blind to them. Of course, I have seen some frequentists in denial about or blind to real problems like the subjective nature of typical sampling models in applications; that makes uncommited use of frequentist methods as subjunctive as uncommitted use of Bayesian methods (subjunctive, as in hypothetical, conditional, and heavily cautioned)

I think our fundamental disagreement, which will not be fixable no matter how long the discussion, is that you feel you are in a position to question any choice of prior distribution, and you are indirectly stating that a prior must be ‘right’. I only believe that a prior must be agreed upon, not ‘right’, and I believe that even a prior that is a bit off to one reviewer will most often provide a posterior distribution that is defensible and certainly is more actionable.
Sigh, you may be right it isn’t ‘fixable’ but I see that as stemming from the fact that I am not saying all that (I’ve heard it as you have - but not from me). I thus don’t see you as understanding what I see myself as trying to communicate (on the topic of this page): That I see the bad language framing and exclusivity of thought that has plagued all of statistics from some of the highest theory (e.g., Neyman, DeFinetti) to the lowliest TA for a “Stat 1 for nonmath people” course, as well as publications.

You noted these issues go deep - well it’s deeper than the frequentist-Bayes split, which I think has become an artificial obsession based on exclusivist thinking in mid-20th century “thought leaders.” One can only wonder what the field would have been like if (for all sides) common practice had stemmed from Good, instead of from Neyman or Fisher for frequentists and from Jeffreys or DeFinetti for Bayesians. Well thankfully one applied area did have Box to talk sense: That both sides talk about the same models of the world, the only split is that one is obsessing on Pr(data|model) and the other on Pr(model|data). But there are plenty applications (e.g., all of mine) where you need to consider both (albeit usually at different stages of planning and analysis).

Now, every stakeholder has a right to question your prior and your sampling model (your total model) as well as whatever loss function underlying a decision (explicit or implied, there is always one behind every decision) and whatever technique was used to merge those ingredients to produce P-values, posterior probabilities, or whatever. Isn’t our job as statisticians to help make all those ingredients as precise, transparent, and clear as possible, so that the criticisms can be anticipated and addressed? And to work with our collaborators to present a compelling rationale for our choices - ideally we’d try to make it compelling to anyone, frequentist, Bayes, or other interested party. And we’d do so knowing we may have made some choices that may look poor in light of facts we did not realize, leaving our results open to criticism on these grounds. We only err when we don’t take those criticisms seriously, for that is when we fail to walk back our inferences or decisions when this new information would call for that under the methodologic rules we claim to follow (e.g., deductive consistency).

For better or worse, however, to err is human and we also have to mitigate many errors handed down to us from authorities, like the misuse of English (e.g., using “null” for any hypothesis instead of a zero hypothesis), and the belief that the terms “frequentist” and “Bayesian” should refer to anything other than techniques (as opposed to philosophies) when applying statistics (as opposed to arguing about philosophy).



Try and get a campaign going with Ron Wasserstein, ASA Director (whom I’ve found very open to such issues and ideas) at You can tell him I thought that would be a good use of ASA resources if the goal was to improve terminology, not to defend what’s in place.



I agree that P-values should be computed for several hypotheses and, as David Cox agrees, we can compute the probability the P-value would exceed (or be smaller than) observed under varying alternatives. Of course this is the basis for a severity assessment, but it doesn’t matter what it’s called.



I think the ASA should encourage all distinct views to weigh in and not set itself up as giving pronouncements on interpretations, especially when the input comes from a select group with certain aims, and particularly as there’s non-trivial disagreement (as is seen here). I agree with David Cox who maintains that the ASA (and RSS) should be discussion forums, encouraging many different perspectives, rather than dictating a single interpretation or position.



So you will keep P-values but banish attained level of statistical significance? (Erich Lehmann at times calls the P-value the “critical level,” and someplace, I believe, the “significance probability”–which would really confuse. For Cox, the (observed) level of statistical significance (the P-value) reflects the level of incompatibility or inconsistency with the null hypothesis. Nor does he think it necessary to add “observed”.



You mean banish the term “attained significance level”? The quantity is useful (e.g., for decision rules) but Fisher had a less misleading term for it: P-value. So yes, like “the aether” and “consumption” (for tuberculosis) any statistical term with “significance” needs to be retired to historical notes.

Those that institutionalized this destructive “significance” terminology were brilliant conceptually and mathematically gifted, and their students were at least the latter (it was a tiny elite that made it into their tutelage). Those authorities could be pretty sloppy with words and still not be misled by the methods, since they were good at mapping between math and context. For example, Fisher and Neyman knew that any decision parameter like alpha should depend on context.

But the vast majority of users and readers have only the words to hang on to - the math is and always will be a contentless abstraction for them. Our statistical heroes past, no matter how brilliant, were simply unaware of this problem (and really, how could they be?). That does not mean there is no problem: There obviously is a crisis in understanding the meaning of statistics. In dealing with it we need to retire bad traditions, including bad terminology, and not replace them with bad new ones. That’s part of statistics reform.

As for “observed P-value”, “observed” is needed to distinguish the one-dataset Fisherian P-value p that Cox calls “significance level” (a conditional data probability) from the Neymanian P-value P which is a uniform random variable (e.g., see Kuffner & Walker, TAS 2018). Like “significance level”, another example of different authorities using the same term for different concepts. No wonder researchers began mixing up p and alpha, and some statisticians forgot that a P-value degrades information if not calibrated to (uniform under) the assumptions it is computed from (e.g., posterior predictive P-values).



I don’t think we need guidelines as much as a few resources that discuss this more in depth with the problematic usage and history of the commonly used terms/phrases.

I do wonder if Ron would be open to a collaboration (it doesn’t have to ASA-related) but something to create a resource where common usage of phrases/terms is discussed.

I appreciate and believe much of the discussion here has been incredibly insightful but I believe it’s also incredibly restricted and only visited by those who are already interested in proper statistics. Some of the problems that have been brought up here, such as the damage of the usage of the term “significance” and “confidence” have not really been brought up elsewhere.

For example, the ASA statement and its supplements, including @Sander’s, discuss some of the problems with dichotomization and misinterpretations, and Sander’s recent papers on P-value behavior also discuss some of the discrepancies in the definition of “significance” etc, but it doesn’t seem an argument (similar to the one in this forum by @f2harrell or @Sander) on why usage of these terms is detrimental to statistics and science has been fleshed out in any sort of paper that can actually be referenced.

If there is one, I would appreciate a link. I think this is something worth thinking about because if I were to go on a universal resource like Wikipedia and attempt to make some edits to add the phrase “compatibility interval” to the entry on confidence intervals, it is very likely that my suggestions would get rejected without some sort of authoritative reference.



I can get behind these suggestions and completely avoid the term “significance” (even in a hypothesis testing framework) by sticking to “xx hypothesis is/is not rejected by our prespecified criterion” and behind “compatibility intervals.” I’ve edited my posts near the top of the forum to reflect these thoughts.



I’m very late to this party, but still want to offer my 2 cents.

95% confidence does not seem to be such a useful property because it refers to the procedure rather than the observed data. So it’s difficult to fault researchers for interpreting confidence intervals as if they are credibility (or posterior) intervals. What else are they supposed to do with them?

Better terminology is certainly important, but I think it’s even more important to make better intervals.

1 Like


Not at all with my collaborations over the years. A fundamental reason why there is no 1-1 equivalence between Bayes and frequentists is many situations is that there is no agreement among frequentists in how to handle multiplicity. Areas in which 1-1 equivalence is impossible or at least very difficult are in sequential testing and other situations with stochastic N, and multiple outcomes. For multiple outcomes, e.g., a Bayesian posterior probability that a treatment makes a major improvement on two specific outcomes, a moderate improvement on one other target, or any improvement in mortality, it’s easy to say that there is an equivalence but no one has shown how to derive it.

That’s not correct in my opinion. The big new assumption of Bayes is the prior, and prior is in plain site and criticizable. You are avoiding decision making and not recognizing how concordant Bayesian posterior inference is with decision making. Sample space considerations are very hard to be transparent about, and often very hard to take into account. The idea that frequentists can make coherent decisions with calculations about data is a hard sell. This whole argument is caused by fundamental problems in indirect reasoning.

I like everything else you said.

Let’s stop the Bayesian debate at this point and try to come to the best language for communicating and interpreting frequentist results that might be used in the medical literature. Your feedback on the “more agnostic” edits I made yesterday would help us all move forward in getting more reasonable language.



I’d suggest you reread some of the posts in this discussion.