Language for communicating frequentist results about treatment effects

That’s my take. Since frequentists don’t believe that probabilities are associated with model parameters, whether you can find special cases of equivalence doesn’t matter very much. You have to go “all in” with Bayes in order to get the interpretation clinicians seek, e.g. the probability is 0.95 that the unknown treatment effect is between -5 and 13mmHg. Also note that the special case breaks down entirely if the sample size is not fixed in advance or if there is more than one look at the data. Bayesians have no way nor need to adjust for multiplicity of this kind.

Simon I’m so glad to be reminded about this super applicable paper.

1 Like

For a 95% CI, how about the straightforward interpretation of a prediction interval:

“There is an 83% chance a replication with the same N will obtain a result within this interval”

To my mind, this is the most useful interpretations. Note the capture rate is not 95% because both estimates are subject to sampling error. I think that means that as the sample-size for the replication grows towards infinity the expected capture rate grows toward the confidence level.

Reference is Cumming & Maillardet, 2006,

I’m curious if you object to these or not. I’m also curious if you think the same statements can be made about Bayesian credible intervals. I can’t quite wrap my head around it, but I think Bayesian credible intervals are not prediction intervals,

1 Like

I don’t feel that will be quite satisfying for the majority of collaborations. And I think that evidentiary statements need to be based on the parameter space, not the data space.

For Bayes this is quite clean: draw samples from the posterior predictive distribution. But again, the data space is not really what’s called for IMHO.

I’m sorry, but I’m not sure if I understand the response.

For myself, as a scientist, I am quite happy to use CIs as prediction intervals–they help calibrate my expectations of what to expect if my lab (or another) replicates the procedure. When the CI is very long, I must sadly admit that the replication could provide almost any outcome and that my current work provides almost no guidance in what to expect. When the CI is very short I can make a relatively precise prediction that is helpful for planning the replication and interpreting the outcome. I guess every scientist is different, but I want nothing more than to properly calibrate my expectations about future research outcomes.

Could you expand a bit on the Bayesian credible interval as a prediction interval? I know Bayesian’s don’t define probability in terms of frequency. So does a credible interval have any meaningful interpretation in terms of expected frequencies of replication outcomes? Does the qualities of a credible interval as a prediction interval depend strongly on the prior?

I think that’s a very good use of the prediction interval. But I don’t see it as useful for making an inference about the unknown effect.

Right, Bayesian methods don’t use the frequency interpretation of probability very often. I wasn’t speaking of using a credible interval for a prediction interval. A credible interval is a posterior distribution interval, which is a statement about an unknown parameter. On the other hand, you can compute the Bayesian posterior predictive distribution to simulate new data or new group means based on the current signal about the unknown parameters, and the prior.

This is a very insightful and helpful thread. I wanted to jump in with a quick question which I often encounter. In the drug trial for BP reduction, the effect for B-A was, say, 3 mmHg with a 0.95 confidence interval of (-1, 6) and a p-value of 0.06. How should we interpret this? Does the interpretation that “we did not find evidence against the null hypothesis of A=B” still hold in this case? What if the point estimate was 6 mmHg, 0.95 CI (-1, 13) and p-value 0.06? Thank you!

For the record, I disagree with a lot of what is given here.
To start, I don’t think this statement is perfectly accurate:
“We were unable to find evidence against the hypothesis that A=B (p=0.4) with the current sample size.”
The statement claims zero evidence against the tested hypothesis, whereas I would say instead the p=0.4 constitutes almost no evidence against that hypothesis (by the Shannon surprisal measure I use, p=0.4 supplies s = log2(1/0.4) = 1.3 bits of information against the hypothesis - barely more information than in one coin toss). The sample size addition is unnecessary if harmless, as p combines (some would say confounds) sample size and estimate size and a larger estimate would have likely produced a smaller p at the same sample size (only “likely” because exceptions can occur when the standard error inflates faster than the estimate at a given sample size).

Conversely, I think it is misleading to say “The study found evidence against the hypothesis that A=B (p=0.02)” because there is no magic point at which P-values start to “find evidence” against hypotheses. Instead one could say “The study found log2(1/.02) = 5.6 bits of information against the hypothesis that A=B.” The ultimate goal is to get away from dichotomies like “negative” and “positive” trials - those are all experimental tests of the hypothesis, and their P-values measure only the amount of information each supply against that hypothesis.

I might agree with the sentiment behind “As the statistical analysis plan specified a frequentist approach, the study did not provide evidence of similarity of A and B”, but it also seems wrong as stated because

  1. it seems to confuse the pre-specification problem with the fact that P-values do not (in isolation, at least) measure support, period, only degree of conflict (incompatibility) between models and data, and
  2. a frequentist can assess similarity indirectly by specifying a similarity (equivalence) interval and seeing whether there is more than a given amount of evidence against the true difference being outside that interval. Thus I would restate it as something like “As the statistical analysis plan did not specify an interval of equivalence, we did not assess similarity of A and B.”

I also think this statement commits a potentially important if standard mistake of omission:
“Assuming the study’s experimental design and sampling scheme, the probability is 0.4 that another study would yield a test statistic for comparing two means that is more impressive that what we observed in our study, if treatment B had exactly the same true mean as treatment A.” It needs amendment to something like
“Assuming the study’s sampling scheme, experimental design, and analysis protocol, the probability is 0.4 that another study would yield a test statistic for comparing two means that is as or more impressive that what we observed in our study, if treatment B had exactly the same true mean as treatment A and all statistical modeling assumptions used to get p are correct or harmless.” That’s because almost all statistics I see in my areas are derived from regression models (whether for the outcome, treatment, or both). Sorry if the added conditions seem nit-picky, but it is not hard to find examples where their failure has a non-negligible effect.

Next, I think this statement is commonly believed and simply wrong from an information perspective: " we cannot interpret the interval except for its properties in the long run." No: As noted just below that statement, the 95% interval can be interpreted for this single data analysis as the interval of all parameter values for which p>0.05. Thus the interval shows the parameter values for which the data supply less than log2(1/0.05) = 4.3 bits of information against them, given the model used to compute the interval. This interpretation does rely on the repeated-sampling property that, given the model, the random P across studies is uniform at the correct value; this property ensures that the S-value captures the refutational information in the test statistic (note that posterior predictive P-values are not uniform and thus do not sustain this interpretation).

Finally, if not clear from the above, I disagree that P-values are wisely dispensed with in favor of confidence intervals. Confidence intervals invite the distortion of 0.05-level dichotomization. The problem with P-values is that almost no one computes them for more than the null hypothesis. They should instead be given not only for the targeted “null” but also at least for one more hypothesis, including the protocol-specified alternative used to compute power or sample size.


I believe @Sander raises some important points, especially this,

The common way to mention p is to discuss the assumption of the null hypothesis being true, but few definitions mention that every model assumption used to calculate p must be correct, including assumptions about randomization (assignment, sampling), chance alone, database errors, etc.

and also this point,

I don’t believe abandoning p values is really helpful. Sure they are easy to misinterpret, but that doesn’t mean we abandon them. Perhaps instead, we can encourage the following guidelines:

  • thinking of them as continuous measures of compatibility between the data and the model used to compute them. Larger p = higher compatibility with the model, smaller p= less compatibility with the model

  • converting them into S values to find how much information is embedded in the test statistic computed from the model, which supplies information against the test hypothesis

  • calculate the p value for the alternative hypothesis, and the S value for that too

1 Like

I also can’t say I agree with the ideas here,

If we are to be honest, the concepts of statistical power, statistical significance, and hypothesis testing are likely not going to disappear in favor of estimation, especially in many experimental fields, so I don’t believe it’s very helpful that we abandon the usage of these phrases.

Better that we emphasize teaching what statistical significance means, and how it differs from something clinically significant etc. The meanings and interpretations of statistical significance and non significance should be taught rather than abandoned out of fear of misinterpretation. Attempting to avoid such usage may actually backfire in the long run

1 Like

The ideas that p-values are redeemable are not convincing to me. They cause far more damage than benefit. So we’ll have to strongly disagree about this. I agree with many of Sander’s other points, but this one, though technically correct, is not as helpful to the researcher as it appears because it is not phrased in terms of the underlying inference/research question:

This comment is not quite on the mark in the original context because I was meaning to address the most common situation in which researchers and journals are misusing efficacy tests and do not have a similarity or non-inferiority test in mind:

The statement Sander makes with which I agree the most is the idea that there is no cutoff for when evidence is present vs. absent. I need to change the language to reflect this.

Over the next days I plan to edit the original post to reflect some, but not all, of the criticisms. This will be an incremental process as we seek to find acceptable wording that is short enough to actually be included in an article’s abstract.


“The ideas that p-values are redeemable are not convincing to me. They cause far more damage than benefit.” No, the misuse of P-values as “significance tests” is what causes the damage, especially when they are applied to only one hypothesis (this misuse was an “innovation” of Karl Pearson and Fisher; before them “significance” referred to a directional Bayesian tail probability, not a data-tail probability).

Neymanian confidence limits show the two points at which p=0.05, which is fine if one can picture how p smoothly varies beyond and between those points. But few users seem able to do that. Plus 95% confidence intervals have the disadvantage of relying on the misleading 0.05 dichotomy whereas at least P-values can be presented with no mention of it. No surprise then that in the med literature I constantly see confidence intervals used only as means to mis-declare presence or absence of an association. This abuse is aggravated by the overconfidence produced by the horrendous adjective “confidence” being used to describe what is merely a summary of a P-value function, and thus only displays compatibility; for that abuse of English, Neyman is as culpable for the current mess as Fisher is for abuse of the terms “significance” and “null”.

The point is that confidence intervals are just another take on the same basic statistical model, and as subject to abuse as P-values. Thus if one can’t “redeem” P-values then one should face up to reality and not “redeem” confidence intervals either. That’s exactly what Trafimow & Marks did (to their credit) when they banned both p-values and confidence intervals from their journal. Unfortunately for such bold moves, the alternatives are just as subject to abuse, and some like Bayesian tests based on null-spiked priors only make the problems worse.

The core problems are however created by erroneous and oversimplifying tutorials, books (especially of the “made easy” sort, which seems a code for “made wrong”), teachers, users, reviewers, and editors. Blaming the methods is just evading the real problems, which are the profound limits of human competency and the perverse incentives for analysis and publication. Without tackling these root psychosocial problems, superficial solutions like replacing one statistical summary with another can have only very limited impact (I say that even though I still promote the superficial solution of replacing P-values with their Shannon information transform, the S-value -log2(p) which you Frank continue to ignore in your responses).


I can’t disagree, but in my opinion when foundational problems with a method lead to the majority of those using it to misuse it, I don’t care so much whether it’s the idea or the people. This is related to Don Berry’s story on why he became Bayesian after failing to teach bright statistics graduate students what a confidence interval really means. He concluded that the foundation is defective.

An excellent point. Richard McElreath in his book Statistical Rethinking made the point by always using 0.88 or some such credible intervals, instead of 0.95. We use 0.95 as a knee jerk default and don’t think often enough of its arbitrariness. At least for normal-theory methods you can derive an interval for any confidence level if you know the 0.95 interval. I caution researchers to not use “does the interval contain zero” thinking, but the temptation is real.

As you know I’d go with Bayesian posterior inference, but I think your statement goes too far. Confidence intervals have problems but are pragmatic solutions that at least move in the right direction and contain far more information that p-values.

Very true, but IMHO does not fully cover the foundational pitfalls.

I don’t dislike this in any sense—I’m just not used to it enough yet. I use Shannon information when teaching the pitfalls of dichotomization of measurements (e.g., “hypertension” has 1 bit of information, and systolic blood pressure as usually measured has 4).

Thanks for the thought provoking interchange.


Frank, as an example of what I mean, take a look at the article by Brown et al. JAMA 2017;317:1544, whose abstract reports the following results:
“adjusted HR [Cox-model hazard ratio] 1.59 [95% CI, 1.17-2.17]). After inverse probability of treatment weighting based on the high-dimensional propensity score, the association was not significant (HR, 1.61 [95% CI, 0.997-2.59])”
from which it offers in the conclusion:
“in utero serotonergic antidepressant exposure compared with no exposure
was not associated with autism spectrum disorder in the child.”
I see this kind of nonsense all the time in leading medical journals. I conclude that confidence intervals are no better than significance tests for stopping serious statistics abuse. And I have seen Bayesian tests and intervals in these sorts of venues get abused in exactly the same sort of way. This is a psychosocial issue that statisticians have shown themselves ill-equipped to deal with in trying to shift the problem to some sort of foundational question. Why? Because it’s not a statistical issue, it’s a psychosocial one stemming from people and the system they’ve created. That’s why changing methods will only move this inference dustpile around under the statistical rug - people will still find ways to distort information using biased narrative descriptions (whether to hide unwanted results or trumpet dubious discoveries). Mitigation will take better reviewing and monitoring of narratives (not just methods) than is currently the norm (and that will be fought hard by those most committed to past distortions, like JAMA).


There is only one thing that I disagree with in this post: " p-values may be dispensed with when frequentist analysis is used". Since the easiest way to calculate what matters, the (minimum) false positive risk, is from the p value, if the p value is not stated, one has to extract it from the confidence interval. That’s easy but irritating.

1 Like

That is indeed a wonderful teaching example (and a terrible example of statistical thinking). This speaks to your emphasis on untying confidence intervals from whether they contain the null. I could use one of your earlier arguments to conclude that it’s not necessarily the foundations but the poor use in practice that we’re seeing in this example (but I believe that both are happening here).

I do feel that changing the narrative makes things better. Sometimes you have to change a system to get people to wake up. If going Bayes though we need to go all the way in this sense: as you’ve said, a 0.95 credible interval still required the arbitrary 0.95 choice. Bayesians prefer to draw the entire posterior distribution. My own preference is to show the posterior density superimposed with subjective probabilities of clinical interest: probability of any efficacy, probability of clinical similarity, probability of “big” efficacy. Except for the probability of any efficacy, these require clinical judgments (separate from priors, data model, and other posterior probabilities) to define values to request posterior probabilities for.


Just one addition: Sander’s teaching example is discussed in our new paper (Amrhein, Trafimow & Greenland)


Frank: Bear in mind that a frequentist can parallel every presentation and narrative maneuver of a Bayesian, and vice versa. Embedding both in a hierarchical framework there is a straight mapping between them (Good, Am Stat 1987;41: 92). In your description you could instead show the corresponding penalized-likelihood function or penalized P-value (“confidence”) function with the points of clinical interest called out. Either would achieve the same goal of helping change the narrative toward a continuous one, away from the dichotomania that has dominated so much of the research literature since Fisher’s heyday. This shows that the issue is not foundational or mathematical, it is presentational and thus cognitive.

The resistance I have encountered in medical-journal venues to such change is staggering, with editors attempting to back-rationalize every bad description and practice they have published (often appealing to anonymous “statistical consultants” for authoritative validation). Perhaps that’s unsurprising since to see the problem would be to admit that their policies have led to headlining of false statements (and thus harming of patients) for generations. The only idea I have for effecting change is to begin prominent publicization of each specific error (such as the Brown example) in the hopes that external outrage and protest accumulates enough to start having an impact.


Sander it’s probably best to keep that disagreement to another topic. Just one example to show the depth of the disagreement: frequentists would have a very hard time of doing the equivalent of computing the posterior probability that at least 3 of 5 outcomes are improved by a treatment, where ‘improved’ means a positive difference in means between treatments.

I don’t doubt this in the least. But I still want to resist it. One highly regarded statistician (I forget who) stated that he will not allow his name to be on a paper that uses the phrase “statistical significance”. I take various strong stands as a frequent reviewer for medical journals, and find myself frequently disagreeing with clinician reviewers on statistical issues. For Bayes it’s a cart or the horse issue sometimes. Until its used more often, journal editors will resist.

Besides publicizing errors the usual way, a little twitter shame can help, plus journal club blogs. I think the low hanging fruit is the use of cutoffs for anything, and the ‘absence of evidence is not evidence of absence’ error.


Frank: This disagreement is central to the topic heading here as well as to its sister topic of communicating Bayesian results. In all this I am surprised by your failure to see the frequentist-Bayes mapping, which is key to proper use and description of both types of statistics. Please read Good (1987 at least; it’s one page).

If we treat the hierarchical model used for your Bayesian odds (prior+conditional data model) as a two-stage (random-parameter) sampling model, we can compute a P-value for the hypothesis that the treatment improves at least 3 of 5 outcomes. This computation can use the same estimating function as in classical Bayesian computation, where the log-prior is treated as a penalty function added to the loglikelihood. The penalized-likelihood-ratio P-values are then Laplace-type approximations to posterior probabilities (and excellent ones in regular GLMs). But even better the mapping provides checks on standard overconfident Bayesian probabilities such as “the” posterior probability that at least 3 of 5 outcomes are improved by treatment.

First, the mapping provides a cognitive check in the form of a sampling narrative: If treating the prior as a parameter sampler (or its log as a frequentist penalty) looks fishy, we are warned that maybe our prior doesn’t have a good grounding in genuine data. Such narrative checks are essential; without them we should define a Bayesian statistician as a fool who will base analyses on clinical opinions from those whose statistical understanding he wouldn’t trust for a nanosecond and whose beliefs have been warped by misreporting of the type seen in Brown et al. (which was headlined in Medscape as showing “no link”).

Going “full-on Bayes” is dangerous, not only when it fails to check its assumptions within a contextual sampling narrative, but when it misses Box’s point that there are frequentist diagnostics for every Bayesian statement. In your example a posterior odds of at least 3:2 on improvement would call for diagnostics on the (prior+data) model used to make that claim, including a P-value for the compatibility of the prior with the likelihood (not a “posterior predictive P-value”, which is junk in frequentist terms).

Frequentist results are chronically misrepresented and hence miscommunicated. Yes their Bayesian counterparts can be phrased more straightforwardly, but that’s not always an advantage because their overconfident descriptions are harder to see as misleading. “The posterior probability” is a prime example because there is no single posterior probability, there is only a posterior probability computed from a possibly very wrong hierarchical (prior+data) model. Similarly there is no single P-value for a model or hypothesis.

How is this relevant to proper interpretation and communication? Well, it would improve frequentist interpretations if they recognized that the size of a computed P-value may reflect shortcomings of the model used to compute it other than falsity of the targeted hypothesis. The hierarchical mapping tells us that Bayesian interpretations would also be improved if they recognized that the size of a posterior probability may only reflect shortcomings of the underlying hierarchical (prior+sampling) model used to compute it rather than a wise bet about the effect.

In my view, providing only a posterior distribution (or P-value function) without such strong conditioning statements is an example of uncertainty laundering (Gelman’s term), just as is calling P-values “significance levels” or intervals summarizing them “confidence intervals.” And I think it will lead to much Bayesian distortion of the medical and health literature. Even more distortion will follow once researchers learn how to specify priors to better ensure the null ends up with high or low posterior probability (or in or out of the posterior interval). Hoping to report the biggest effect estimate you can? Use a reference prior or Gelman’s silly inflated-t(1)/Cauchy prior on that parameter. Hoping instead to report a “null finding”? Use a null spike or a double exponential/Lasso prior around the null. Via the prior, Bayes opens up massive flexibility for subtly gaming the analysis - in addition to the flexibility in the data model that was already there (and which Brown et al. exploited along with dichotomania in switching from a Cox model to HDPS for framing their foregone null conclusions).


I’m glad you pushed forward on Frank’s response. I was about to chime in with a less elegant response, though underscoring the gravity of the queries you present as they have real life consequences on patients.