Updates to NEJM Statistical Guidelines for Authors

The highly revered statistical consultants to the New England Journal of Medicine in conjunction with senior medical leaders have revised guidelines for statistical reporting in the journal. The new guidelines may be seen here. A major emphasis of the revision is moving away from bright-line criteria for deeming a treatment to be effective or ineffective, consistent with the American Statistical Association’s initiative. This is most welcomed. The NEJM’s new emphasis on confidence intervals and openness to Bayesian methods is also welcomed. So is the continued emphasis on pre-specification.

A particular applaudable statement is “the notion that a treatment is effective for a particular outcome if P<0.05 and ineffective if that threshold is not reached is a reductionist view of medicine that does not always reflect reality.” It remains to be seen however whether medical writers at NEJM will continue to override statisticians in insisting that abstracts be re-worded in an attempt to salvage value from a clinical trial when P>0.05. NEJM has routinely made the absence of evidence is not evidence of absence error using wording such as “treatment X did not benefit patients with …” in these situations, as if it is a shame to admit that we still don’t know if the treatment is beneficial. The increased emphasis on confidence intervals will hopefully help, even if the revered statistical advisors are overruled by medical writers.

The guidelines, as in the past, place much emphasis on multiplicity adjustments. Multiplicity adjustments are notoriously ad hoc, because recipes for these adjustments do not follow from statistical theory and the adjustments are arbitrary. More importantly, multiplicity adjustments do not recognize that many studies attempt to provide marginal information, i.e., to answer a series of separate questions that may be interpreted separately. As eloquently described by Cook and Farewell a very reasonable approach is to have a strong pre-specified priority ordering for hypotheses, which also specifies a reporting order and leads to context preservation. It should be fine to report the 11th endpoint in a study in a paper in which the results of the 10 previous endpoints are also summarized. I would have liked to have seen NEJM recognize the validity of the Cook and Farewell approach.

The guidelines do not recognize that type I error is far from the chance that a positive interpretation of a study is wrong. It is merely the chance of interpreting a finding as “positive” if indeed the treatment has exactly zero effect and cannot harm patients. The statement that “Clinicians and regulatory agencies must make decisions about which treatment to use or to allow to be marketed, and P values interpreted by reliably calculated thresholds subjected to appropriate adjustments have a role in those decisions” is not backed up by theory or evidence. The P value is a probability about data if the treatment is ineffective, not a probability about the true treatment effect.

Delving deeper into the revised guidelines, I have a few other concerns:

  • The excellent advice on handling missing data did not include the more unified full Bayesian modeling approach.
  • The guideline states that “confidence intervals should be adjusted to match any adjustment made to significance levels”. This is problematic. The precision of estimating A should not always depend on whether B was estimated.
  • The guidelines fail to state that non-inferiority margins must be approved by patients. Silly straw-man large non-inferiority margins are appearing in the journal routinely.
  • Reporting of P-values “only until the last comparison for which the P value was statistically significant” is silly. Researchers should not be penalized for trying to answer more questions, as long as the priority ordering and context is rigidly maintained.
  • The guidelines condones the use of subgroup analysis. Such non-covariate-adjusted analysis that invites the use of improper subgroups (e.g., “older patients”) are highly problematic and statistically non-competitive. The guidelines imply that forest plots should be based on actual subgroups and not derived from a unified statistical model. The guidelines go so far as to say "A list of P values for treatment by subgroup interactions is subject to the problems of multiplicity and has limited value for inference. Therefore, in most cases, no P values for interaction should be provided in the forest plots. Where did this notion come from? The interaction test is the only thing that is safe in a subgroup assessment—the only thing that exposes the difficulty of the task. Why would we worry about validity of multiplicity adjustments here and not in overall study analyses?
  • The statement that “odds ratios should be avoided, as they may overestimate the relative risks in many settings and may be misinterpreted” is terribly misguided, failing to recognize the failure of relative risks to be transportable outside the clinical trial sample.
  • The guidance related to cautionary interpretation of observational studies dwells on multiplicity adjustment instead of bias, unmeasured confounders, sensitivity analysis, and the use of skeptical priors in Bayesian modeling.
  • “When the analysis depends on the confounders being balanced by exposure group, differences between groups should be summarized with point estimates and 95% confidence limits when appropriate” fails to appreciate that statistical inference regarding imbalances is not the appropriate criterion, and that one should adjust for potential impalances not just statistically impressive ones in observational studies. The guidelines need to clarify that for randomized studies, pre-specified covariate adjustment should be favored, and the list of pre-specified covariates has nothing to do with balance.
  • “Complex models and their diagnostics can often be best described in a supplementary appendix.” I agree about the diagnostics part, but complex models are frequently best and should be featured front and center in the paper.
  • Dichotomania is rampant in NEJM (and all other medical journals). The statistical guidelines need to take a strong stand against losing information in continuous and ordinal variables and against the arbitrariness and increased sample sizes necessitated by the use of thresholds instead of analyzing continuous relationships.
  • The statistical guidelines lost an opportunity to criticize the use of unadjusted analysis as the primary analysis, and to discuss the cost, time, and ethical implications of avoiding analysis of covariance in randomized trials, which require lower sample sizes to achieve the same power of unadjusted analyses (see Chapter 13 here).
  • The guidelines lost an opportunity to steer researchers away from problematic change from baseline measures (see Section 14.4 here).
  • The guidelines lost an opportunity to alert authors to problems caused by stratifying Table 1 by treatment.

@albest3 alerted me to the statistical guidelines for urology journals (see his note below) by Assel et al. These guidelines in my view are far better than the NEJM statistical guidelines. They are more comprehensive and more actionable, at least within the domain of frequentist statistics. They better foster good statistical practice.

Links to Other Discussions


that’s go to be the word of the day. it feels like they have been saying these things forever, it’s so perplexing to me, like one of those mad films where everything is repeated anew…

1 Like

Thanks, Frank, for digging in & offering these perspectives!


I always learn something from your posts on statistical issues.

Question regarding multiplicity adjustments – in what scenario would you insist that they be done from a frequentist POV?

If I understand this (and some of your prior writings) on the topic, it seems like they would be mandatory for post hoc hypotheses after data collection, but not necessary from a priori specified hypotheses.

Would it also be accurate to assume that in the circumstances where multiplicity adjustments be made to hypothesis tests, those same confidence/compatibility intervals should also be adjusted?

Thanks for this helpful commentary!

It seems confusing how the guidelines create somewhat of a dichotomy of when to use vs. not use p-values (i.e. when adjusted for multiplicity/false discovery rate vs. not when they aren’t; in hierarchical testing when significant vs. not thereafter).

While I understand the much-needed movement away from p-values as an arbitrary cutoff for statistical significance, I’m not so sure that creating “rules” for when to use and not use them, as the guidelines do, will solve the problem. Is there not merit to increased attention on how they are taught, used, and interpreted in conjunction with CIs, during both study design and analysis? To me, it seems like the effort of identifying situations where one should (or should not) report p-values detracts from the overall need for a better understanding of what a pvalue is and how it should be considered in medical decision making

woth emphasising: this recent paper distinguishes between obs data and clinical trials: There is still a place for significancetesting in clinical trials

“the more modest proposal of dropping the concept of ‘statistical significance’ when conducting statistical tests could makethings worse” etc

1 Like

P values in clinical trials can always be justified (and calculated) by permutation tests based upon the design. Wasn’t that Fisher’s key justification for the randomization procedure?

The same cannot be said for observational studies. I’m not sure how big of an error this is, however. At least in simulations, the t test results generally matches the permutation test results.

Addendum: I re-read the first paper I posted by Berger, Lunneborg, Ernst, and Levene who argue that RCTs should be examined using permutation tests. I find their reasons for permutation techniques (as well as approximate permutation methods) in this context unassailable and logically correct.

If credibility for the parametric test derives from assurances that its p-value will likely be close to the corresponding exact one, then this is tantamount to an admission that the exact test is the gold standard (or, perhaps, the platinum standard). Approximate tests cannot be any more exact than the exact tests they are trying to approximate and, as approximations to the exact tests, are correct only to the extent that they agree with the exact test. A “heads I win, tails you lose” situation then arises, because if the parametric and exact tests lead to essentially the same inference, then this is as much an argument in favor of the exact test as it is for the parametric test, and there is no benefit to using the parametric test. If they do not agree, then the exact test needs to be used.

1 Like

It seems to me that these guidelines get into all sort of problems and inconsistencies that are fundamental issues related to the frequentist approach.

Correcting for multiple testing has never made any sense to me and adjusting confidence intervals for the number of tests seems equally arbitrary and meaningless.

It has been said many time before, but, if one wishes to frame a question in the framework of null hypothesis testing the primary question of interest is “what is the probability that the null is true given the data”. @f2harrell I know you are no fan of thinking in this way because “apart from with homeopathy the null hypothesis is never true”. But, I think it a useful way of framing a question.

Approximate Bayes approaches applied to a specified prior at least enables an approximate probablistic answer to that question, and in my experience can be understood by the non-specialist statistician (at least it makes sense). Wakefield’s Bayes False Discovery Probability is easy to calculate from standard frequentist stats. Such an approach helps with mutliple tests and subgroup analyses (by definition, one prior for these hypotheses have to be smaller than that for the primary endpoint (for which there ought ot be equipoise). This approach also helps one to understand why a p of (say) 0.02 is less likely to replicate if produced by a smalll study than if produced by a large study.


I would say that if doing frequentist analysis and the questions are not marginal, then multiplicity adjustment is called for, e.g. when making statements such as “there exists an endpoint affected by this treatment” or “there exists a subgroup for which this treatment is effective”.

I’m less sure that multiplicity adjustments should apply equally to compatibility/confidence intervals, especially when users are not falling into the trap of just caring whether or not the interval includes the null.


I would love to get @Stephen take on this. He’s an expert on this history.

1 Like

Stephen Senn has a great talk titled ‘70 Years Old and Still Here: The Randomized Clinical Trial and its Critics’ This talk should have much more than 99 views! https://www.youtube.com/watch?v=uTrVhIwy8e8


I’ll definitely watch this later. I always learn something from Stephen Senn, even when I suspect there is room for educated disagreement.

Suffice it to say – the misunderstanding of randomization from a decision theory POV, along with a categorical interpretation of “evidence hierarchies” has IMO, done more harm than good for the typical consumer of the research literature – to put it mildly.

I’d hope that so-called meta-science can start looking at things from the point of view of statistical decision theory, and we can have more constructive debates about what decisions are supported by evidence.

Addendum: Everyone should watch that Senn video on randomization! Still, I think he picked on weak critics of randomization.

But perhaps that deserves a separate topic.

1 Like

Have you any references for tougher critics of randomization? Please share as I’d like to read their arguments.

It is more of a normative and theoretical vs applied and practical issue. Bayesian actors do not need randomization from a strict decision theory POV, if it is for his/her own learning.

That doesn’t mean it is wrong to randomize, only that the experiment will be less efficient than a Bayesian optimal one. Frequentists will counter that the “optimal” experiment may not be derivable from our current knowledge.

And there are a number of practical Bayesians who defend the use of randomization as a measure to reduce the effect of the prior on the posterior – a type of robustness measure, in case the prior was in error somehow.

Senn has written “Block what you can, randomize what you cannot.” He also is not a fan of minimization, to put it mildly.

Randomization has more justification when the experiment is designed to persuade a skeptical decision maker. (Although there are some strong arguments that it isn’t needed even in this scenario). At least in human experimentation, it is a credible way to control bias.

I’ve studied this issue because, at least in my field – neither myself, nor my colleagues will be able to run studies the size of even a modest pharmaceutical study. A larger, observational study is much more likely.

Randomization in small samples (ie. less than 200 per arm) leads to excess heterogeneity, making extraction of information from the few repeatable, small sample experiments more difficult, and less likely to be included in any research syntheses.

In terms of “evidence hierarchies” there is a common overvaluation of randomization, and undervaluation of observational evidence. There have been very interesting, Bayesian decision theory derived clinical trials that have been conducted without randomization.

There is also a design known as minimization, with algorithms derived by Taves, Pocock, etc. in the 70’s. But less than 2% of clinical trials use minimization.

This ultimately comes down to the economic value of the information, which is seldom formally accounted for in a research program at an aggregate/policy level.

That is an informal write up ATM. I think this is best in a separate thread that I might post over the weekend.

I would also be interested in your take on https://www.ncbi.nlm.nih.gov/pubmed/30537407 “Guidelines for reporting of statistics for clinical research in urology”
… which seems to me to be much more readable and practical.


Wow. Those guidelines are fantastic. I added a link to them in the Vanderbilt Checklist for Authors. These guidelines are much more actionable than the NEJM guidelines in my view, and do much more to foster good statistical practice (frequentist practice, at least). Here is the full article, which thankfully is open access.


Dear Frank, could you comment on 3.4?

Investigators often interpret confidence intervals in terms of hypotheses. For instance, investigators might claim that there is a statistically significant difference between groups because the 95% confidence interval forthe odds ratio excludes 1. Such claims are problematic because confidence intervals areconcerned with estimation, not inference.

I don’t get that, given my understanding is right that the construction of confidence intervals is the inverse to hypothesis tests (an (1-a)-CI is defined as the interval of hypotheses being rejected at p<a for the given data).

I do understad the second argument that --due to using inexact inversions-- the limits of the CI may not match the significance level of a test exactly. To my opinion, the given example (CI starts at 1.03 but p = 0.07) is not really dramatic, as it is a borderline case anyway). Here, test and CI were calculated based on different methods, so it’s not unexpected to get different results. The main concern here should be why different methods are employed at all. I think that it would be obviousely strange if one would report a p-value for a mean difference based on a t-test but calculate the CI based on the z-distribution (particularily for smaller samples).


Great question Jochen, and one that I hope those who have read the literature on this better than I will respond to. As you said you can define a confidence interval as the set of all parameter values that if null hypothesized would not be rejected at say the 0.05 level. But since we are moving away from a p < 0.05 world, much has been written about why one should avoid looking to see if the confidence interval includes the null. What exactly we are to do with confidence intervals is left a bit unspecified, especially since they do not have a ready interpretation for the experiment at hand but have only long-run operating characteristics over an infinite sequence of identical experiments.

In addition to avoiding confidence intervals as restated hypothesis tests, there are specific confidence interval fallacies, and I think Sander Greenland wrote a definitive paper on this, besides this one. This may be of interest, and especially this.

A major limitation of confidence intervals is that the limits are not controllable by the analyst. Most researchers will like to know things like the following. What is the probability that the treatment effect is between 0 and 7? A Bayesian posterior probability of [0,7] is easily calculated, but there is no way to get a “confidence level” for [0,7]. Instead one specifies the coverage probability and gets different limits as a solution.

1 Like

I also found this blog post by Richard Morey to be a very thought-provoking supplement to the article and slides you linked to.

1 Like

I can’t argue with his logic, but I would say that emphasizing the CI minimizes the lazy approach I see too often in my field – scan the abstract or paper for reports of hypothesis test results, ignore the actual estimate, and then draw grand conclusions (ie. absence of evidence fallacy).

1 Like