- As Sander Greenland has so well stated below, this initial draft was still too “dichotomous” in that it was written assuming we would have different language for “positive” vs. “negative” studies. This implies a threshold for “positive” which is what we’re trying to get away from. So I’m trying to develop more “positive/negative agnostic” language. One initial stab at this is below in the Generic Interpretation section. The examples demonstrating incorrect and confusing statements appearing in the first draft still stand.
- I have edited posts to be clear that a negative estimate for the treatment B minus treatment A blood pressure is beneficial, and used an example point estimate of -4mmHg in some cases.
- I have incorporated Sander’s nomenclature instead of “confidence interval”. To be able to use new terminology that is less confusing, it will be necessary to educate journal editors and reviewers. A section making an initial attempt to do so appears below.
As detailed here, there are many advantages of using a Bayesian approach for stating and interpreting statistical results of an efficacy comparison in a randomized clinical trial. But within the frequentist framework, what is the most honest, accurate, and transparent language to use? We know that p-values are greatly overused and often misinterpreted, and perhaps a majority of statisticians believe that confidence intervals are preferred to p-values, if frequentist methods are used. The primary reason for this belief is that a confidence interval contains all the information contained in a p-value, but also contains the point estimate and a measure of uncertainty in that estimate. This allows a confidence interval to be used, at least indirectly, to bring evidence in favor of small effects. On the other hand, because of the absence of evidence is not evidence of absence problem, a p-value conveys essentially no information when it’s large. A p-value may be large because the effect is small or because the uncertainty is large, due to a small sample size. Ronald Fisher’s advice for interpreting a large p-value was “get more data”.
There are two types of non-wrong language that may be used to describe frequentist statistical results: exact and verbose, or approximate, pragmatic, and brief. Both are used below. Note that “statistical significance” or “non-significance” is nowhere to be found in any the preferred language.
Because language below challenges the status quo in medical articles, we start with proposed text to include in the manuscript submission cover letter.
The manuscript submitted for your review used classical (frequentist) statistical methods. In classical statistics, P-values and significance tests have been misinterpreted and misused in biomedical research to such an extent that the American Statistical Association found it necessary to issue a formal expression of concern with a recommendation to not use arbitrary thresholds for significance and to not attempt to use the p-value by itself to present evidence from data about underlying effects. On a related issue, the problem of medical articles attempting to infer lack of effects from large p-values, the so-called absence of evidence is not evidence of absence error, is widespread. The terms “statistical significance” and “statistical insignificance” have come to mean very little, and we believe that if any cutoffs for evidence are to be used, these cutoffs should be left to readers and decision makers. So in our submitted manuscript we avoid the words “significance”, “insignificance” (in the statistical, not the clinical sense), and p-values cutoffs (α level).
Confidence limits have been suggested as alternatives or at least supplements to p-values, and we agree they add value. But confidence intervals have been badly misinterpreted throughout their history as providing an interval within which the unknown effect has a stated probability of occupying. The word “confidence” was meant to soften the probability interpretation, but misunderstandings have persisted. A confidence interval is a random interval that enjoys good long-term operating characteristics but makes no statement from the dataset at hand. To avoid such confusion, we use the term compatibility interval within the text. This term is more consonant with what the interval really means, since a 0.95 (for example) confidence interval is the set of all effects that if hypothesized to be the true unknown value of a parameter (e.g., a treatment effect) the statistical test at that “null” hypothesis would not be rejected at the 0.05 (for example) level. Thus the compatibility interval (confidence interval) is the set of all values of the true effect that are compatible with the data in the sense of not rejecting a hypothesis.
Consider the examples from here. Suppose that the primary endpoint is systolic blood pressure (SBP) and that the effect of treatment B in comparison with treatment A is summarized by the B-A difference in mean SBP. Suppose that treatment A had a sample mean SBP of 134 mmHg and treatment B had 130 mmHg so that the observed B-A difference is -4 mmHg. Suppose that the two-sided p-value for this comparison is 0.4, and the 0.95 compatibility interval is [-13, 5].
The money was spent.
The presumption of no difference is not yet overcome by data.
Treatment B does not improve SBP when compared to A (p=0.4)
Treatment B was not significantly different from treatment A (p=0.4). (If this statement allowed the reader to conclude anything, why did the study need to randomize more than 2 patients?)
We were unable to find evidence against the hypothesis that A=B (p=0.4) with the current sample size. More data will be needed. As the statistical analysis plan specified a frequentist approach, the study did not provide evidence of similarity of A and B (but see the compatibility interval).
Assuming the study’s experimental design and sampling scheme, the probability is 0.4 that another study would yield a test statistic for comparing two means that is more impressive that what we observed in our study, if treatment B had exactly the same true mean SBP as treatment A. (This statement is best at pointing out the limitations of how p-values can be interpreted.)
The study did not contradict the supposition that treatments A and B yield the same mean SBP (p=0.4).
There is a 0.95 chance that the true difference in mean SBP is between -13 and +5.
The observed B-A difference in means was -4mmHg with a 0.95 confidence interval of [-13, 5]. If this study could be indefinitely replicated and the same approach used to compute a confidence interval each time, 0.95 of such varying confidence intervals would contain the unknown true difference in means. Based on the current study, the probability that the true difference is within [-13, 5] is either zero or one, i.e., we cannot interpret the interval except for its properties in the long run.
Other good language, paraphrased from from Richard Morey:
If the true SBP mean difference < -13, we’d almost never (0.025 of the time) expect a difference as large as we observed. Thus differences < -13 are contradicted by the data. If the true difference > 5. we’d almost never (0.025 of the time) expect a difference as small as we observed. Thus differences > 5 are contradicted by the data.
The data are consistent with a true difference of means between -13 and 5. (This statement is accurate in the sense that a 0.95 compatibility interval is defined as the set of all values of the true mean difference that if null-hypothesized would not be rejected at the 0.05 level.)
Compatibility intervals do not depend on single null hypotheses, and are interpretable no matter how small or large is the p-value. In a more “positive” study the compatibility interval may be shifted to the right. But the above interpretations stand. What wording should be used pertaining to the p-value? Some examples are below.
Treatment B is better than treatment A (p=0.02). (That is true for the sample data, and may be true in general, but p-values don’t prove anything)
There was a statistically significant lowering of SBP by treatment B (p=0.02).
The study found evidence against the hypothesis that A=B (p=0.02). (It is not exactly correct to say that the study provided evidence in favor of positive efficacy, but that would be slight stretch of what frequentist tests do.)
Assuming the study’s experimental design and sampling scheme, the probability is 0.02 that another study would yield a test statistic for comparing two means that is more impressive that what we observed in our study, if treatment B had exactly the same true mean SBP as treatment A.
The safest way to state results is to remove positive/negative judgments and leave the decision to the reader or someone else who possesses a greater totality of evidence or who wants to insert their own utility/loss/cost function into their decisions regarding medical treatments. So we start with that in the next subsection. But that will not provide medical journals the headlines they so vigorously seek.
Treatment B was observed in our sample of n subjects to have a 4mmHg lower mean SBP than treatment A with a 0.95 2-sided compatibility interval of [-13,5]. If absolutely necessary to state a p-value: … compatibility interval of [-13, 5], p=0.11 without declaring significance or insignificance. The compatibility interval indicates a wide range of plausible true treatment effects.
Treatment B was observed in our sample of n subjects to have a 4mmHg lower mean SBP than treatment A with a 0.95 2-sided compatibility interval of [-13, 5], indicating a wide range of plausible true treatment effects. The degree of evidence against the null hypothesis that the treatments are interchangeable is p=0.11. The smaller the p-value the greater the evidence against the null.
Note: Had the p-value been “small” and the compatibility interval been [-10, -1], the phrase “indicating a wide range of plausible true treatment effects” could remain. Had the compatibility interval been [-8, -6] one could say “indicating a narrow range for plausible true treatment effects in the absence of extra-study information”. One could defer all conclusions about the compatibility limits, which will be unsatisfying to reviewers and editors.
Note: To help readers avoid absence of evidence is not evidence of absence errors, one may consider adding e.g. if the CL is [-10, 2] "The data are consistent with a hypothesized mean SBP increase of 2mmHg with treatment B all the way to a hypothesized 10mmHg decrease, where “consistent with” is taken to mean “not rejected at the arbitrary 0.05 level” (all quoted text is intended to appear in the paper).
Since p-values are less informative than compatibility intervals, and unlike p-values confidence intervals are useful in bringing evidence in favor of small effects (or non-inferiority), p-values may often be dispensed with when frequentist analysis is used. Compatibility intervals must be understood as having only long-term operating characteristics, and wording of their interpretations should be accurate if interpretations are not left to the reader. If the authors or editor insists on inclusion of p-values, interpretations should be accurately worded or left to the reader. Large p-values are more often misinterpreted than small p-values.
- Confidence Intervals without Your Collaborator’s Tears by Michael Höhle
- Five statistical fax pas by Andy Field
- Against inferential statistics: How and why current statistics teaching gets It wrong by Patrick White and Stephen Gorard