The highly revered statistical consultants to the New England Journal of Medicine in conjunction with senior medical leaders have revised guidelines for statistical reporting in the journal. The new guidelines may be seen here. A major emphasis of the revision is moving away from bright-line criteria for deeming a treatment to be effective or ineffective, consistent with the American Statistical Association’s initiative. This is most welcomed. The NEJM’s new emphasis on confidence intervals and openness to Bayesian methods is also welcomed. So is the continued emphasis on pre-specification.
A particular applaudable statement is “the notion that a treatment is effective for a particular outcome if P<0.05 and ineffective if that threshold is not reached is a reductionist view of medicine that does not always reflect reality.” It remains to be seen however whether medical writers at NEJM will continue to override statisticians in insisting that abstracts be re-worded in an attempt to salvage value from a clinical trial when P>0.05. NEJM has routinely made the absence of evidence is not evidence of absence error using wording such as “treatment X did not benefit patients with …” in these situations, as if it is a shame to admit that we still don’t know if the treatment is beneficial. The increased emphasis on confidence intervals will hopefully help, even if the revered statistical advisors are overruled by medical writers.
The guidelines, as in the past, place much emphasis on multiplicity adjustments. Multiplicity adjustments are notoriously ad hoc, because recipes for these adjustments do not follow from statistical theory and the adjustments are arbitrary. More importantly, multiplicity adjustments do not recognize that many studies attempt to provide marginal information, i.e., to answer a series of separate questions that may be interpreted separately. As eloquently described by Cook and Farewell a very reasonable approach is to have a strong pre-specified priority ordering for hypotheses, which also specifies a reporting order and leads to context preservation. It should be fine to report the 11th endpoint in a study in a paper in which the results of the 10 previous endpoints are also summarized. I would have liked to have seen NEJM recognize the validity of the Cook and Farewell approach.
The guidelines do not recognize that type I error is far from the chance that a positive interpretation of a study is wrong. It is merely the chance of interpreting a finding as “positive” if indeed the treatment has exactly zero effect and cannot harm patients. The statement that “Clinicians and regulatory agencies must make decisions about which treatment to use or to allow to be marketed, and P values interpreted by reliably calculated thresholds subjected to appropriate adjustments have a role in those decisions” is not backed up by theory or evidence. The P value is a probability about data if the treatment is ineffective, not a probability about the true treatment effect.
Delving deeper into the revised guidelines, I have a few other concerns:
- The excellent advice on handling missing data did not include the more unified full Bayesian modeling approach.
- The guideline states that “confidence intervals should be adjusted to match any adjustment made to significance levels”. This is problematic. The precision of estimating A should not always depend on whether B was estimated.
- The guidelines fail to state that non-inferiority margins must be approved by patients. Silly straw-man large non-inferiority margins are appearing in the journal routinely.
- Reporting of P-values “only until the last comparison for which the P value was statistically significant” is silly. Researchers should not be penalized for trying to answer more questions, as long as the priority ordering and context is rigidly maintained.
- The guidelines condones the use of subgroup analysis. Such non-covariate-adjusted analysis that invites the use of improper subgroups (e.g., “older patients”) are highly problematic and statistically non-competitive. The guidelines imply that forest plots should be based on actual subgroups and not derived from a unified statistical model. The guidelines go so far as to say "A list of P values for treatment by subgroup interactions is subject to the problems of multiplicity and has limited value for inference. Therefore, in most cases, no P values for interaction should be provided in the forest plots. Where did this notion come from? The interaction test is the only thing that is safe in a subgroup assessment—the only thing that exposes the difficulty of the task. Why would we worry about validity of multiplicity adjustments here and not in overall study analyses?
- The statement that “odds ratios should be avoided, as they may overestimate the relative risks in many settings and may be misinterpreted” is terribly misguided, failing to recognize the failure of relative risks to be transportable outside the clinical trial sample.
- The guidance related to cautionary interpretation of observational studies dwells on multiplicity adjustment instead of bias, unmeasured confounders, sensitivity analysis, and the use of skeptical priors in Bayesian modeling.
- “When the analysis depends on the confounders being balanced by exposure group, differences between groups should be summarized with point estimates and 95% confidence limits when appropriate” fails to appreciate that statistical inference regarding imbalances is not the appropriate criterion, and that one should adjust for potential impalances not just statistically impressive ones in observational studies. The guidelines need to clarify that for randomized studies, pre-specified covariate adjustment should be favored, and the list of pre-specified covariates has nothing to do with balance.
- “Complex models and their diagnostics can often be best described in a supplementary appendix.” I agree about the diagnostics part, but complex models are frequently best and should be featured front and center in the paper.
- Dichotomania is rampant in NEJM (and all other medical journals). The statistical guidelines need to take a strong stand against losing information in continuous and ordinal variables and against the arbitrariness and increased sample sizes necessitated by the use of thresholds instead of analyzing continuous relationships.
- The statistical guidelines lost an opportunity to criticize the use of unadjusted analysis as the primary analysis, and to discuss the cost, time, and ethical implications of avoiding analysis of covariance in randomized trials, which require lower sample sizes to achieve the same power of unadjusted analyses (see Chapter 13 here).
- The guidelines lost an opportunity to steer researchers away from problematic change from baseline measures (see Section 14.4 here).
- The guidelines lost an opportunity to alert authors to problems caused by stratifying Table 1 by treatment.
@albest3 alerted me to the statistical guidelines for urology journals (see his note below) by Assel et al. These guidelines in my view are far better than the NEJM statistical guidelines. They are more comprehensive and more actionable, at least within the domain of frequentist statistics. They better foster good statistical practice.
Links to Other Discussions
- Discussion by Deborah Mayo