# Language for communicating frequentist results about treatment effects

Editorial Notes

• As Sander Greenland has so well stated below, this initial draft was still too “dichotomous” in that it was written assuming we would have different language for “positive” vs. “negative” studies. This implies a threshold for “positive” which is what we’re trying to get away from. So I’m trying to develop more “positive/negative agnostic” language. One initial stab at this is below in the Generic Interpretation section. The examples demonstrating incorrect and confusing statements appearing in the first draft still stand.
• I have edited posts to be clear that a negative estimate for the treatment B minus treatment A blood pressure is beneficial, and used an example point estimate of -4mmHg in some cases.
• I have incorporated Sander’s nomenclature instead of “confidence interval”. To be able to use new terminology that is less confusing, it will be necessary to educate journal editors and reviewers. A section making an initial attempt to do so appears below.

As detailed here, there are many advantages of using a Bayesian approach for stating and interpreting statistical results of an efficacy comparison in a randomized clinical trial. But within the frequentist framework, what is the most honest, accurate, and transparent language to use? We know that p-values are greatly overused and often misinterpreted, and perhaps a majority of statisticians believe that confidence intervals are preferred to p-values, if frequentist methods are used. The primary reason for this belief is that a confidence interval contains all the information contained in a p-value, but also contains the point estimate and a measure of uncertainty in that estimate. This allows a confidence interval to be used, at least indirectly, to bring evidence in favor of small effects. On the other hand, because of the absence of evidence is not evidence of absence problem, a p-value conveys essentially no information when it’s large. A p-value may be large because the effect is small or because the uncertainty is large, due to a small sample size. Ronald Fisher’s advice for interpreting a large p-value was “get more data”.

There are two types of non-wrong language that may be used to describe frequentist statistical results: exact and verbose, or approximate, pragmatic, and brief. Both are used below. Note that “statistical significance” or “non-significance” is nowhere to be found in any the preferred language.

Because language below challenges the status quo in medical articles, we start with proposed text to include in the manuscript submission cover letter.

# Text for Cover Letter

The manuscript submitted for your review used classical (frequentist) statistical methods. In classical statistics, P-values and significance tests have been misinterpreted and misused in biomedical research to such an extent that the American Statistical Association found it necessary to issue a formal expression of concern with a recommendation to not use arbitrary thresholds for significance and to not attempt to use the p-value by itself to present evidence from data about underlying effects. On a related issue, the problem of medical articles attempting to infer lack of effects from large p-values, the so-called absence of evidence is not evidence of absence error, is widespread. The terms “statistical significance” and “statistical insignificance” have come to mean very little, and we believe that if any cutoffs for evidence are to be used, these cutoffs should be left to readers and decision makers. So in our submitted manuscript we avoid the words “significance”, “insignificance” (in the statistical, not the clinical sense), and p-values cutoffs (α level).

Confidence limits have been suggested as alternatives or at least supplements to p-values, and we agree they add value. But confidence intervals have been badly misinterpreted throughout their history as providing an interval within which the unknown effect has a stated probability of occupying. The word “confidence” was meant to soften the probability interpretation, but misunderstandings have persisted. A confidence interval is a random interval that enjoys good long-term operating characteristics but makes no statement from the dataset at hand. To avoid such confusion, we use the term compatibility interval within the text. This term is more consonant with what the interval really means, since a 0.95 (for example) confidence interval is the set of all effects that if hypothesized to be the true unknown value of a parameter (e.g., a treatment effect) the statistical test at that “null” hypothesis would not be rejected at the 0.05 (for example) level. Thus the compatibility interval (confidence interval) is the set of all values of the true effect that are compatible with the data in the sense of not rejecting a hypothesis.

# Interpretations of So-called “Negative” Trial

Consider the examples from here. Suppose that the primary endpoint is systolic blood pressure (SBP) and that the effect of treatment B in comparison with treatment A is summarized by the B-A difference in mean SBP. Suppose that treatment A had a sample mean SBP of 134 mmHg and treatment B had 130 mmHg so that the observed B-A difference is -4 mmHg. Suppose that the two-sided p-value for this comparison is 0.4, and the 0.95 compatibility interval is [-13, 5].

## Brutally Honest Statement

The money was spent.

## Less Brutal Honest Statement That Admits Arbitrariness of Sample Size

The presumption of no difference is not yet overcome by data.

## Incorrect Statement

Treatment B does not improve SBP when compared to A (p=0.4)

## Confusing Statement

Treatment B was not significantly different from treatment A (p=0.4). (If this statement allowed the reader to conclude anything, why did the study need to randomize more than 2 patients?)

## Accurate Statement

We were unable to find evidence against the hypothesis that A=B (p=0.4) with the current sample size. More data will be needed. As the statistical analysis plan specified a frequentist approach, the study did not provide evidence of similarity of A and B (but see the compatibility interval).

## Accurate Statement That is More Direct

Assuming the study’s experimental design and sampling scheme, the probability is 0.4 that another study would yield a test statistic for comparing two means that is more impressive that what we observed in our study, if treatment B had exactly the same true mean SBP as treatment A. (This statement is best at pointing out the limitations of how p-values can be interpreted.)

## Shorter Statement

The study did not contradict the supposition that treatments A and B yield the same mean SBP (p=0.4).

## Incorrect Interpretation of Compatibiity/Confidence Interval

There is a 0.95 chance that the true difference in mean SBP is between -13 and +5.

## Accurate Interpretation of Compatibility Interval

The observed B-A difference in means was -4mmHg with a 0.95 confidence interval of [-13, 5]. If this study could be indefinitely replicated and the same approach used to compute a confidence interval each time, 0.95 of such varying confidence intervals would contain the unknown true difference in means. Based on the current study, the probability that the true difference is within [-13, 5] is either zero or one, i.e., we cannot interpret the interval except for its properties in the long run.

Other good language, paraphrased from from Richard Morey:
If the true SBP mean difference < -13, we’d almost never (0.025 of the time) expect a difference as large as we observed. Thus differences < -13 are contradicted by the data. If the true difference > 5. we’d almost never (0.025 of the time) expect a difference as small as we observed. Thus differences > 5 are contradicted by the data.

## Slightly Inaccurate but Useful Succinct Interpretation of Compatibility Interval

The data are consistent with a true difference of means between -13 and 5. (This statement is accurate in the sense that a 0.95 compatibility interval is defined as the set of all values of the true mean difference that if null-hypothesized would not be rejected at the 0.05 level.)

# What About a “Positive” Trial?

Compatibility intervals do not depend on single null hypotheses, and are interpretable no matter how small or large is the p-value. In a more “positive” study the compatibility interval may be shifted to the right. But the above interpretations stand. What wording should be used pertaining to the p-value? Some examples are below.

## Incorrect Statement

Treatment B is better than treatment A (p=0.02). (That is true for the sample data, and may be true in general, but p-values don’t prove anything)

## Confusing Statement

There was a statistically significant lowering of SBP by treatment B (p=0.02).

## Accurate Statement

The study found evidence against the hypothesis that A=B (p=0.02). (It is not exactly correct to say that the study provided evidence in favor of positive efficacy, but that would be slight stretch of what frequentist tests do.)

## Accurate Statement That More Directly Defines the p-value

Assuming the study’s experimental design and sampling scheme, the probability is 0.02 that another study would yield a test statistic for comparing two means that is more impressive that what we observed in our study, if treatment B had exactly the same true mean SBP as treatment A.

# Generic Interpretation

The safest way to state results is to remove positive/negative judgments and leave the decision to the reader or someone else who possesses a greater totality of evidence or who wants to insert their own utility/loss/cost function into their decisions regarding medical treatments. So we start with that in the next subsection. But that will not provide medical journals the headlines they so vigorously seek.

## Safe Statement of Results Agnostic to Positive/Negative

Treatment B was observed in our sample of n subjects to have a 4mmHg lower mean SBP than treatment A with a 0.95 2-sided compatibility interval of [-13,5]. If absolutely necessary to state a p-value: … compatibility interval of [-13, 5], p=0.11 without declaring significance or insignificance. The compatibility interval indicates a wide range of plausible true treatment effects.

## Agnostic Statement That Alludes to a Degree of Evidence

Treatment B was observed in our sample of n subjects to have a 4mmHg lower mean SBP than treatment A with a 0.95 2-sided compatibility interval of [-13, 5], indicating a wide range of plausible true treatment effects. The degree of evidence against the null hypothesis that the treatments are interchangeable is p=0.11. The smaller the p-value the greater the evidence against the null.

Note: Had the p-value been “small” and the compatibility interval been [-10, -1], the phrase “indicating a wide range of plausible true treatment effects” could remain. Had the compatibility interval been [-8, -6] one could say “indicating a narrow range for plausible true treatment effects in the absence of extra-study information”. One could defer all conclusions about the compatibility limits, which will be unsatisfying to reviewers and editors.

Note: To help readers avoid absence of evidence is not evidence of absence errors, one may consider adding e.g. if the CL is [-10, 2] "The data are consistent with a hypothesized mean SBP increase of 2mmHg with treatment B all the way to a hypothesized 10mmHg decrease, where “consistent with” is taken to mean “not rejected at the arbitrary 0.05 level” (all quoted text is intended to appear in the paper).

# Summary

Since p-values are less informative than compatibility intervals, and unlike p-values confidence intervals are useful in bringing evidence in favor of small effects (or non-inferiority), p-values may often be dispensed with when frequentist analysis is used. Compatibility intervals must be understood as having only long-term operating characteristics, and wording of their interpretations should be accurate if interpretations are not left to the reader. If the authors or editor insists on inclusion of p-values, interpretations should be accurately worded or left to the reader. Large p-values are more often misinterpreted than small p-values.

31 Likes

This is great! There is another inaccurate interpretation of a confidence intervals that sometimes creeps its way into the discourse. Taking the example [-5, 13], I’ve seen it incorrectly interpreted as follows: “If this study could be indefinitely replicated and the same approach used to compute an estimate each time, 95% of the resulting estimates would lie between -5 and 13.”

They get the idea of the infinite replicates right, but—and I’m totally sympathetic to this—they forget that they can’t actually meaningfully use the specific endpoints of their confidence interval in the “infinite replicate” interpretation.

For this reason, I like the “slightly inaccurate but useful succinct interpretation,” although I usually phrase it as follows: “The confidence interval [a, b] contains all the values that cannot be ruled out by your data at the 0.05 level.”

4 Likes

Let’s say we don’t report p-values at all. How would you report an effect of 4 mmHg with a 0.95 confidence interval of [2, 6]?

Short: The observed B-A difference in mean SBP was 4mmHg with 0.95 confidence interval of [2,6].

Longer: The observed B-A difference in mean SBP was 4mmHg with 0.95 confidence interval [2,6]. There is evidence for moderate improvement in efficacy with B, with the data consistent with a difference as low as 2. Or to use @SpiekerStats approach: an effect as small as 2 cannot be ruled out at the 0.05 level. Or: there is little evidence against an effect as low as 2 or as large as 6.

3 Likes

I doubt any paper would allow such reporting but I would write it like this

"Although there may have been a difference in the means between the samples, we did not reject the null hypothesis based on our prespecified criterion .The 95% compatibility interval tells us that values as low as -5 and as high as 13 are compatible with the test model, assuming that every assumption is correct

Our results should not be taken as evidence for the null because it is possible that we may have lacked the statistical power to reject the test hypothesis, but it is also possible that there really is no meaningful difference between these two treatments.

In order to produce more certain evidence of little to no difference, future investigators may want to utilize equivalence testing to reject meaningful effects or plan studies based on precision and if the produced compatibility interval intervals are closely clustered around the null, we can be more confident that there is no meaningful difference."

*Edited to remove mentions of “statistical significance” and “confidence” has been replaced with “compatibility.”

3 Likes

I don’t want to say “significant” anywhere. It implies a p-value cutoff and is unclear (e.g., doesn’t distinguish statistical from clinical significance). And “we do not have enough data” is relative. The confidence interval tells you how much data you have so I don’t feel we need to add language to that. “Much uncertainty” could also be called relative and I’d rather let the interval speak.

I would use the confidence interval for that, without having to do a different statistical test.

2 Likes

It’s incorrect to interpret the confidence interval as the probability that B-A is between the lower and upper limit of the interval. That’s stats 101.

But isn’t the confidence interval equivalent to the Bayesian credible interval with a flat prior? At least asymptotically? From this perspective, isn’t the above interpretation of the confidence interval sort of correct?

May be stats 101 but almost nobody understands this.

2nd point; yes, sort of. The numbers may be the same but the interpretation isn’t, so I wouldn’t say they are equivalent.

My modest contribution: https://osf.io/725sz/
Incorrect statements abound!

1 Like

That’s my take. Since frequentists don’t believe that probabilities are associated with model parameters, whether you can find special cases of equivalence doesn’t matter very much. You have to go “all in” with Bayes in order to get the interpretation clinicians seek, e.g. the probability is 0.95 that the unknown treatment effect is between -5 and 13mmHg. Also note that the special case breaks down entirely if the sample size is not fixed in advance or if there is more than one look at the data. Bayesians have no way nor need to adjust for multiplicity of this kind.

1 Like

For a 95% CI, how about the straightforward interpretation of a prediction interval:

“There is an 83% chance a replication with the same N will obtain a result within this interval”

To my mind, this is the most useful interpretations. Note the capture rate is not 95% because both estimates are subject to sampling error. I think that means that as the sample-size for the replication grows towards infinity the expected capture rate grows toward the confidence level.

Reference is Cumming & Maillardet, 2006, http://psycnet.apa.org/buy/2006-11159-001.

I’m curious if you object to these or not. I’m also curious if you think the same statements can be made about Bayesian credible intervals. I can’t quite wrap my head around it, but I think Bayesian credible intervals are not prediction intervals,

1 Like

I don’t feel that will be quite satisfying for the majority of collaborations. And I think that evidentiary statements need to be based on the parameter space, not the data space.

For Bayes this is quite clean: draw samples from the posterior predictive distribution. But again, the data space is not really what’s called for IMHO.

I’m sorry, but I’m not sure if I understand the response.

For myself, as a scientist, I am quite happy to use CIs as prediction intervals–they help calibrate my expectations of what to expect if my lab (or another) replicates the procedure. When the CI is very long, I must sadly admit that the replication could provide almost any outcome and that my current work provides almost no guidance in what to expect. When the CI is very short I can make a relatively precise prediction that is helpful for planning the replication and interpreting the outcome. I guess every scientist is different, but I want nothing more than to properly calibrate my expectations about future research outcomes.

Could you expand a bit on the Bayesian credible interval as a prediction interval? I know Bayesian’s don’t define probability in terms of frequency. So does a credible interval have any meaningful interpretation in terms of expected frequencies of replication outcomes? Does the qualities of a credible interval as a prediction interval depend strongly on the prior?

I think that’s a very good use of the prediction interval. But I don’t see it as useful for making an inference about the unknown effect.

Right, Bayesian methods don’t use the frequency interpretation of probability very often. I wasn’t speaking of using a credible interval for a prediction interval. A credible interval is a posterior distribution interval, which is a statement about an unknown parameter. On the other hand, you can compute the Bayesian posterior predictive distribution to simulate new data or new group means based on the current signal about the unknown parameters, and the prior.

This is a very insightful and helpful thread. I wanted to jump in with a quick question which I often encounter. In the drug trial for BP reduction, the effect for B-A was, say, 3 mmHg with a 0.95 confidence interval of (-1, 6) and a p-value of 0.06. How should we interpret this? Does the interpretation that “we did not find evidence against the null hypothesis of A=B” still hold in this case? What if the point estimate was 6 mmHg, 0.95 CI (-1, 13) and p-value 0.06? Thank you!

For the record, I disagree with a lot of what is given here.
To start, I don’t think this statement is perfectly accurate:
“We were unable to find evidence against the hypothesis that A=B (p=0.4) with the current sample size.”
The statement claims zero evidence against the tested hypothesis, whereas I would say instead the p=0.4 constitutes almost no evidence against that hypothesis (by the Shannon surprisal measure I use, p=0.4 supplies s = log2(1/0.4) = 1.3 bits of information against the hypothesis - barely more information than in one coin toss). The sample size addition is unnecessary if harmless, as p combines (some would say confounds) sample size and estimate size and a larger estimate would have likely produced a smaller p at the same sample size (only “likely” because exceptions can occur when the standard error inflates faster than the estimate at a given sample size).

Conversely, I think it is misleading to say “The study found evidence against the hypothesis that A=B (p=0.02)” because there is no magic point at which P-values start to “find evidence” against hypotheses. Instead one could say “The study found log2(1/.02) = 5.6 bits of information against the hypothesis that A=B.” The ultimate goal is to get away from dichotomies like “negative” and “positive” trials - those are all experimental tests of the hypothesis, and their P-values measure only the amount of information each supply against that hypothesis.

I might agree with the sentiment behind “As the statistical analysis plan specified a frequentist approach, the study did not provide evidence of similarity of A and B”, but it also seems wrong as stated because

1. it seems to confuse the pre-specification problem with the fact that P-values do not (in isolation, at least) measure support, period, only degree of conflict (incompatibility) between models and data, and
2. a frequentist can assess similarity indirectly by specifying a similarity (equivalence) interval and seeing whether there is more than a given amount of evidence against the true difference being outside that interval. Thus I would restate it as something like “As the statistical analysis plan did not specify an interval of equivalence, we did not assess similarity of A and B.”

I also think this statement commits a potentially important if standard mistake of omission:
“Assuming the study’s experimental design and sampling scheme, the probability is 0.4 that another study would yield a test statistic for comparing two means that is more impressive that what we observed in our study, if treatment B had exactly the same true mean as treatment A.” It needs amendment to something like
“Assuming the study’s sampling scheme, experimental design, and analysis protocol, the probability is 0.4 that another study would yield a test statistic for comparing two means that is as or more impressive that what we observed in our study, if treatment B had exactly the same true mean as treatment A and all statistical modeling assumptions used to get p are correct or harmless.” That’s because almost all statistics I see in my areas are derived from regression models (whether for the outcome, treatment, or both). Sorry if the added conditions seem nit-picky, but it is not hard to find examples where their failure has a non-negligible effect.

Next, I think this statement is commonly believed and simply wrong from an information perspective: " we cannot interpret the interval except for its properties in the long run." No: As noted just below that statement, the 95% interval can be interpreted for this single data analysis as the interval of all parameter values for which p>0.05. Thus the interval shows the parameter values for which the data supply less than log2(1/0.05) = 4.3 bits of information against them, given the model used to compute the interval. This interpretation does rely on the repeated-sampling property that, given the model, the random P across studies is uniform at the correct value; this property ensures that the S-value captures the refutational information in the test statistic (note that posterior predictive P-values are not uniform and thus do not sustain this interpretation).

Finally, if not clear from the above, I disagree that P-values are wisely dispensed with in favor of confidence intervals. Confidence intervals invite the distortion of 0.05-level dichotomization. The problem with P-values is that almost no one computes them for more than the null hypothesis. They should instead be given not only for the targeted “null” but also at least for one more hypothesis, including the protocol-specified alternative used to compute power or sample size.

12 Likes

I believe @Sander raises some important points, especially this,

The common way to mention p is to discuss the assumption of the null hypothesis being true, but few definitions mention that every model assumption used to calculate p must be correct, including assumptions about randomization (assignment, sampling), chance alone, database errors, etc.

and also this point,

I don’t believe abandoning p values is really helpful. Sure they are easy to misinterpret, but that doesn’t mean we abandon them. Perhaps instead, we can encourage the following guidelines:

• thinking of them as continuous measures of compatibility between the data and the model used to compute them. Larger p = higher compatibility with the model, smaller p= less compatibility with the model

• converting them into S values to find how much information is embedded in the test statistic computed from the model, which supplies information against the test hypothesis

• calculate the p value for the alternative hypothesis, and the S value for that too

1 Like

I also can’t say I agree with the ideas here,

If we are to be honest, the concepts of statistical power, statistical significance, and hypothesis testing are likely not going to disappear in favor of estimation, especially in many experimental fields, so I don’t believe it’s very helpful that we abandon the usage of these phrases.

Better that we emphasize teaching what statistical significance means, and how it differs from something clinically significant etc. The meanings and interpretations of statistical significance and non significance should be taught rather than abandoned out of fear of misinterpretation. Attempting to avoid such usage may actually backfire in the long run

1 Like

The ideas that p-values are redeemable are not convincing to me. They cause far more damage than benefit. So we’ll have to strongly disagree about this. I agree with many of Sander’s other points, but this one, though technically correct, is not as helpful to the researcher as it appears because it is not phrased in terms of the underlying inference/research question:

This comment is not quite on the mark in the original context because I was meaning to address the most common situation in which researchers and journals are misusing efficacy tests and do not have a similarity or non-inferiority test in mind:

The statement Sander makes with which I agree the most is the idea that there is no cutoff for when evidence is present vs. absent. I need to change the language to reflect this.

Over the next days I plan to edit the original post to reflect some, but not all, of the criticisms. This will be an incremental process as we seek to find acceptable wording that is short enough to actually be included in an article’s abstract.

2 Likes

“The ideas that p-values are redeemable are not convincing to me. They cause far more damage than benefit.” No, the misuse of P-values as “significance tests” is what causes the damage, especially when they are applied to only one hypothesis (this misuse was an “innovation” of Karl Pearson and Fisher; before them “significance” referred to a directional Bayesian tail probability, not a data-tail probability).

Neymanian confidence limits show the two points at which p=0.05, which is fine if one can picture how p smoothly varies beyond and between those points. But few users seem able to do that. Plus 95% confidence intervals have the disadvantage of relying on the misleading 0.05 dichotomy whereas at least P-values can be presented with no mention of it. No surprise then that in the med literature I constantly see confidence intervals used only as means to mis-declare presence or absence of an association. This abuse is aggravated by the overconfidence produced by the horrendous adjective “confidence” being used to describe what is merely a summary of a P-value function, and thus only displays compatibility; for that abuse of English, Neyman is as culpable for the current mess as Fisher is for abuse of the terms “significance” and “null”.

The point is that confidence intervals are just another take on the same basic statistical model, and as subject to abuse as P-values. Thus if one can’t “redeem” P-values then one should face up to reality and not “redeem” confidence intervals either. That’s exactly what Trafimow & Marks did (to their credit) when they banned both p-values and confidence intervals from their journal. Unfortunately for such bold moves, the alternatives are just as subject to abuse, and some like Bayesian tests based on null-spiked priors only make the problems worse.

The core problems are however created by erroneous and oversimplifying tutorials, books (especially of the “made easy” sort, which seems a code for “made wrong”), teachers, users, reviewers, and editors. Blaming the methods is just evading the real problems, which are the profound limits of human competency and the perverse incentives for analysis and publication. Without tackling these root psychosocial problems, superficial solutions like replacing one statistical summary with another can have only very limited impact (I say that even though I still promote the superficial solution of replacing P-values with their Shannon information transform, the S-value -log2(p) which you Frank continue to ignore in your responses).

8 Likes