Critical Appraisal Checklist for Statistical Methods

In Medical and Clinical Education, we frequently employ “Checklists” to help with critical appraisal of studies. Examples of such checklists are provided by numerous organizations includingBMJ, CASP, CEBM, JBI, and many others.

Here is an example of one from BMJ for RCTs:

One of the limitations of these checklists is the absence of any critical appraisal of statistical methods. I have searched in vain to find any “checklist” that would help students or clinicians with appraising statistical methods. In fact, as @ADAlthousePhD has suggested, this may be an impossible task without further education.

Is anyone aware of any tools or checklists that non-statisticians could employ to critically appraise statistical methods in a biomedical study? If a “checklist” is not the best approach, what alternatives would you suggest?


I think these exist and hope that others provide them here. At the Department of Biostatistics at Vanderbilt University we have tackled the easier problem of creating a checklist of statistical mistakes to avoid. It is here. Additions to that list are welcomed.


I’m surprised to see “Confidence intervals for key effects should always be included”.

A CI gives you a range of hypothetical population values that the data do not differ significantly from, right? For what kind of analysis is that useful information?

Or should I read the advice as: “Please include Confidence Intervals. They are numerically close enough to bayesian credible intervals and may be interpreted as such” ?


1 Like

Nowadays I would word it this way. Include the entire Bayesian posterior distribution or at least Bayesian credible intervals. If these are not feasible, include confidence intervals (what Sander Greenland calls “compatibiity intervals” - a term I really like). Even though compatibiity intervals are indirect measures of uncertainty, they are still very useful and are definitely better than p-values. Many like standard errors, without reference to repeated sampling, as measures of precision, and confidence intervals are related to SEs for symmetric cases.

These issues are discussed in detail here.


Thank you Frank. Also for all the efforts to make this place thrive :bouquet:


Thanks Frank! This is definitely a great starting point.

I may try to use the references to create a simple checklist and post it here for feedback.

1 Like

I’m sorry, Raj, I’m just a little confused by your question, because it seems that many such checklists do exist (EQUATOR network has them for basically every type of study), but perhaps the checklists do not cover quite what you’re hoping.

As I re-read your post, I think you are chasing something that will be very hard to produce - if you want a checklist that goes much deeper than high-level principles (items like those on the list you provided or those on Frank’s list, which is terrific and now bookmarked) it’s going to necessarily be much longer than a checklist.

Yup. Looking for something deeper, and agree it’s something hard to do - that’s why I’m hoping others can help!

There are three possible approaches:

  • Provide a list of errors to avoid (see above)
  • Attempt to provide a comprehensive list of best statistical practices (difficult if containing some level of detail, and will need perpetual updating, which web resources can assist with)
  • Provide a set of guiding principles that should be adhered to and can be fulfilled by a variety of statistical methods. Start with this.

Reporting guidelines have not accomplished any of this to date.


Is this what you are looking after?

1 Like

Please note that the Equator Netowrk guidelines (including SAMPL) are meant as reporting guidelines not as methods for critical appraisal. That is, they (1) should be used by authors preparing manuscripts for submission so they can effectively communicate what they did, and (2) should be demanded by journal editors so they can better understand and review the manuscripts that are submitted. Various Equator Network guidelines exist for different study designs. They are highly recommended for the above two purposes. But they are not designed to guide critical appraisal of the literature. For that, the JAMA Users’ Guide and similar checklists from BMJ or CEBM should be referenced.

– Alexander.


As a complete outsider to statistics, I have made the appeal [on Twitter] that further education is needed to appraise statistical methods. Even statisticians [and physicians] can misinterpret statistics results. So I enthusiastically follow this thread. I don’t plan to become a statistician. But a very informed consumer of statistics and data science. Such education can be applied to many fields if pursued systematically and rigorously.


Raj, I found this forward looking post . I see @ADAlthousePhD point also. l debated some of the points below with him on twitter and I still do not know if we are in agreement.

I think there has to be a checklist for basic components for all; clinician scientist , statistician and trialists. One that all can engage and discuss which, for example, assures the validity and reproducibility of the entry measurement (e.g criteria) as well as any tools used for adjustment in relation to both the population under test as well as the population to which the results will be generalized have been checked.

This is particularly true in critical care. I think few on twitter understood why I was so concerned. My admonitions were not views which were widely held before the Pandemic and during the heady days of massive multicenter RCT application to consensus critical care syndromes like sepsis.

Steps for massive RCT

  1. Create a criteria set for a consensus syndrome (eg by Delphi)
  2. Apply an RCT to the syndrome using the set.
  3. Write protocol based on results of RCT generalized to all that meet the criteria.
  4. Promulgate the protocol as the standard of care for all meeting the criteria

This paper shows the weakness of oversimplifying.

Protocol Failure Detection: The Conflation of Acute Respiratory Distress Syndrome, SARS-CoV-2 Pneumonia and Respiratory Dysfunction - PubMed

.In this paper we show that the pandemic exposed a profound flaw in standard critical care RCT design and more importantly the Trialists interpretation and generalization of the implications of their results in these large high quality Randomized Controlled Trials. The consequence of the cognitive error identified in this paper likely contributed to dangerous early intubation as a standard of care of COVID 19 early in the pandemic.

The cognitive error was to perceive, believe, and vigorously promulgate that the results of an RCT applied to a syndrome were applicable to a new disease because it “met the criteria” of that syndrome. This error is easy to see in retrospect but… why did such an obvious mistake happen?.

Trialists were overconfident that the basics of RCT design and in the value of consensus “Science”. They .thought RCT could be applied to consensus syndromes as if a consensus syndrome was a disease like diabetic ketoacidosis. . There was a narrow perception of the meaning of the term “heterogeneity of treatment effects” . They did not understand that the choice of criteria for the syndrome under test may cause amplification of treatment effect heterogeneity. Trialists were overconfident in the power of randomization and adjustments to overcome these issues.

These have been the warnings I have been providing for the past decade. Trialists should read that paper at least twice and then figure out how to design a checklist which prevents this type of cognitive catastrophe from occurring again. The potential adverse consequences of this for the public are beyond discussion but the point is it is time to abandon political expediency and raise this issue front and center in the elite medical literature…


Hi Lawrence

Your paper is great. It asks whether doctors should treat patients with a novel disease that is manifesting as a previously established clinical syndrome the same way they have always treated the syndrome. For example, COVID is a novel disease that can cause ARDS (a long-recognized clinical syndrome). When the pandemic began, was it reasonable for doctors to extrapolate ventilation strategies that had been established through RCTs involving patients with non COVID-related ARDS to patients with COVID-related ARDS?

Historically, part of the reason that we might assume that treatment efficacy is generalizable across many types of patients is that we’ve been taught that qualitative interactions between patients and treatments tend to be rare. If a treatment shows efficacy in an RCT, we consider it unlikely that there could be a sizeable subset of patients who might be harmed by the treatment (other than perhaps through rare immunologic or genetically-based idiosyncratic reactions). We’ve been much more willing to accept the possibility of quantitative interactions (i.e., the idea that the treatment might simply benefit some groups of patients more than others). If one patient winds up with an ejection fraction of 20% after PCI for STEMI, while another winds up with a fairly normal ejection fraction, we don’t usually consider that the PCI might have harmed the first patient. Rather, we’re more likely to conclude that the PCI just didn’t help the first patient as much (possibly because the territory affected by his occlusion was larger or his door-to-balloon time was too long).

But syndromes are not discrete clinical events like myocardial infarctions. Multiple diverse disease processes can all produce the same syndrome (e.g., ARDS can be caused by pancreatitis or trauma or bowel bowel perforation or bacterial sepsis…). Should we more seriously entertain the possibility of qualitative interactions when we’re considering the efficacy of treatments directed at syndromes?

You describe concern, near the start of the pandemic, that previously-established ARDS ventilation strategies were not working as well as expected. Clinicians’ impressions were that fatality rates among patients with COVID-related ARDS seemed higher than fatality rates among historical patients with other aetiologies for their ARDS (?) In this situation, it was reasonable to ask whether the established ventilation protocols were not just not helping as much as expected, but whether they were actually making outcomes worse than they might have been if an alternate strategy had been used. Complicating any assessment of cause and effect would have been the fact that many factors aside from ventilation strategy can potentially impact the prognosis of critically ill patients.

You propose that doctors need a way to identify, in real-time, whether an established treatment approach might not be having the expected effect in the context of a novel disease. Since ARDS has a high mortality rate and patient deaths are not unexpected, signals of lack of efficacy (or even harm) for an old approach in a new disease context might be missed unless group-level data can be obtained rapidly.

It sounds like one solution you’re proposing is to proactively collect granular data on the clinical trajectories of patients treated with established approaches to clinical syndromes. If we could develop libraries of such data, then later, when treating patients presenting with a novel disease (e.g., COVID-related ARDS), we could check to see whether the new patients’ trajectories are similar to the trajectories we saw in previous patient cohorts. If the trajectories of the new patients appeared less favourable, then maybe this could serve as a signal for doctors to step back and re-evaluate their approach. And maybe this type of lack of efficacy/safety signal could create the clinical equipoise needed to justify rapid recruitment to a large RCT examining different treatment approaches.

Ultimately, you seem to be asking whether the potential for qualitative interactions between patients and treatments might be more important to consider when doctors are treating clinical syndromes like ARDS, rather than diseases with a uniform etiology (like CAD). Please correct me if the above summary is mangled.


Thanks for reviewing this. Your review is correct. In fact this may be the most pressing issue medical statisticians presently face in 2022. We have to teach these points and fix this problem.

The recognition that heterogeneity of treatment effect is amplified by the application of RCT to syndromes. is pivotal but this is not being taught.

As you point out HTE has two dimensions, the number of patient’s with clinically significant variation and the magnitude of the treatment effect variation… So as you say the probability that variation falls below the x axis to the negative (the harm level) increases when the RCT is applied to a syndrome rather than a disease…

This issue has been largely ignored, I think for the reasons you suggest. . The introduction of COVID cases exposed this problem but I see little evidence that thought leader are inclined to study this or even discuss this publicly much less each this the young. … The fact that thought leaders thought that the RCT was applicable to a disease which was not included in the trial shows how little they understand HTE as it relates to syndromes. These types of statements scare those dependent on grants because they are counter to the standard approach but COVID taught us that silence has consequences for the public health.

We will be increasing pressure on those who are resisting the discussion of this. This simply has to be debated in a public forum…

One mathematical challenge is to quantify relevant heterogeneity. of a syndrome before placing it under test, We need this to be addressed by statisticians because trialists cannot understand these issues, apparently due to anchoring. . Further, leaders that have taught that staying the course with a protocol is an essential part of quality, simply cannot contemplate this complexity.

This paper, written before the pandemic predicted this problem. . …

… .


Thanks for posting!

It’s been three years since my original post, but I still think and come back to this topic often.

More recently, I’ve sort of separated this into 2 ways to think of critical appraisal:

  1. statistical choices
  2. non-statistical choices

#1 is something outside a lot of my expertise,. it this is stuff like trial protocols, preregistration, models, statistical analysis, etc. it’s everything internal to the paper, and population studies within.

#2 is all the non-statistical factors important to externally generalize results. i.e. patient selection, comorbidites, disease definition, disease severity, choice of measurements, timing of intervention relative to disease state, etc. it requires nuance and appreciation of clinical factors to interpret. This also requires, as does any attempts to generalize evidence, making assumptions beyond those imbedded in #1.

A lot of these things get crossed over. You can have a study that perfectly meets everyone’s statistical appraisal, but overlook other limitations to external validity. Or a study with limitations in the statistical plan, but still providing some relevant data to generalize to clinical practice.

I have a running list of things to remind myself to check through, but not a comprehensive and formal checklist yet!


Hello Raj

If you have any portion of a chklist to post I think many would appreciate it. Much more needs to be formalized on the front end. We see such formal analysis with the statistics while the measurement defining the input is often completely ignored. However if trials expected a review of the reproducibility of the measurement in the methods or at least a discussion of the limitations induced by the input measurements then we would see progress. As long as they can simply say. " We used the Apnea Hypopnea Index o or the Berlin Criteria, or Sepsis III, as defined by the XX consensus conference." in the methods section. little or no progress can be expected in sleep apnea science , ARDS science , and. Sepsis science respectively .

1 Like

Critical appraisal is about identification of safeguards against bias (aka internal validity) in published research. These safeguards are items built into the design or analysis of the study that prevent one of the five problems below:
-Selection bias
-Information bias
-Analytical errors
-Design related deficiencies

Appraising statistical methods then is about their relevance and appropriateness to the study at hand and its correctness of application and thus these appraisals work the other way around - they need the appraisers expertise for the appraisal to happen correctly but do not “help students or clinicians with appraising statistical methods” and the latter is not the aim of these checklists. They come in two flavors - risk of bias checklists and quality assessment checklists. These should not be confused with reporting checklists which have nothing to do with critical appraisal.


I’d like to post a link to an older @Sander paper on “quality scoring” for meta-analysis, which is closely related to the desire for “checklists”

I’ve found it helpful to approach this from first principles. Given a report from a study, do I think it is sufficiently wrong that I’d be willing to bet against the model estimate in a future study? If so, why?

Putting this in a regression context – we might think of any estimate as a compression of the model that produced it. To be a principled critic of a study requires:

  1. The assertion of a variable that is in principle, measurable and not (adequately) accounted for in the model or
  2. The objection to the inclusion of a variable that may bias the the estimate (or sample variance) in some way.

From the reference

“It appears that ‘quality’ (whatever leads to more valid results) is of fairly high dimension and possibly non‐additive and nonlinear, and that quality dimensions are highly application‐specific and hard to measure from published information.”

Given the complexity of quality its hard to imagine that a checklist is not essential if the goal is to render a very high quality study. However there appears to be little downside to the researcher or statistician for rendering a low quality study. There is no formal quality review of clinical trials as there is in aviation and medicine where lack of quality has consequences. In fact the researcher simply lists limitations in the back as if none of those were preventable. Imagine an engineer doing that, “One of the limitations of our bridge was that we did not have a mixer for the concrete so we mixed it by hand.”… “Rebuilding of the bridge with a mixer is indicated.”.