Statistical Misinterpretations and Clinical Practice Guidelines: any examples?

In this important paper on the interpretation of statistics in medicine, the authors write:

Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant. A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof. Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists. This high cognitive demand has led to an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously so—and yet these misinterpretations dominate much of the scientific literature.

I was curious if anyone had examples of guidelines that:

  1. commit one or more of the errors in described in the paper
  2. have become entrenched in institutional practice due to acceptance by regulatory authorities.

The closest threads that have approached the topic:

Addendum A quick search brought up this article by @Stephen, (inspired by @pmbrown just below) which is a good example of what I’m looking for:


maybe fda’s two trials rule and senn’s comments on that. Although i dont recall where this appears in fda material / edit: probably also critical response to recent ema guideline on subgroup analyses: Commentary on: Subgroup analysis and interpretation for phase 3 confirmatory trials: White Paper of the EFSPI/PSI working group on subgroup analysis by Dane, Spencer, Rosenkranz, Lipkovich, and Parke

1 Like

Several of these examples seem to have more to do with reification errors related to (but perhaps not quite the same as?) those discussed by @Sander [1], rather than with mere misinterpretations of statistical technicalities. That is, the problem is not merely technical, but “cognitive” or (I would rather say) “epistemological.” For example, apart from the utter lack of scholarship on FDA’s part which seems to be @Stephen’s main complaint in the linked post, he also is troubled by how artifacts emerging from a statistical analysis are presumed ipso facto to amount to real things (e.g., responders) that exist in the real world. A lot of @llynn’s concerns fall into the same category: artificial thresholds become reified (e.g., SIRS, OSA) as if they were per se the subject of investigation, rather than mere phenomena of the ding an sich. Likewise, the whole surrogate endpoints business is at least complicated by our predilection toward reifying frequently-used terms such as ‘progression-free survival’.

It seems to me that such cognitive/epistemological errors are far more insidious than technical errors in p-value interpretation, and much more capable of insinuating themselves into CPGs.

  1. Greenland S. Invited Commentary: The Need for Cognitive Science in Methodology. Am J Epidemiol. Published online 2017:1-7. doi:10.1093/aje/kwx259

It seems to me that such cognitive/epistemological errors are far more insidious than technical errors in p-value interpretation, and much more capable of insinuating themselves into CPGs.

Good point. This brings up 2 more questions:

  1. What does the clinician do when someone in authority makes policy decisions based upon things they “know” that really aren’t so?

  2. In the U.S., Federal regulators have made it clear that they intend to shift financial risk to providers of health services through policies like value based healthcare.

How can health service providers make defensible decisions when communities of clinicians and/or regulators are misled by these epistemological errors?


Have you been able to contact the authors of the paper to ask them for examples?
As a non-clinician reading clinical guidelines I’m quite astounded by how they can be put together. I’m not convinced that there is rigorous evaluation of the methodology and interpretation of the studies that inform guidelines.
In 2020 European Society of Cardiology made some changes to the guidelines for assessing patients in the Emergency Room with possible acute coronary syndrome. Those changes have come under some heavy criticism (Giannitsis E, et al. Critical appraisal of the 2020 ESC guideline recommendations on diagnosis and risk assessment in patients with suspected non-ST-segment elevation acute coronary syndrome. Clin Res Cardiol. 2021 Available from: I don’t think it is exactly what you are after [misinterpretations], but it does imply guideline changes were based on inadequate evidence.


“It seems to me that such cognitive/epistemological errors are far more insidious than technical errors in p-value interpretation, and much more capable of insinuating themselves into CPGs.”

This @davidcnorrismd post changed my understanding of “Threshold Science” which I have studied for over a decade.

I have long considered Threshold Science (The use of guessed threshold sets as gold standards or endpoints for RCT) a type of “pathological science”, after Langmuir, (wherein the “pathology” of the science is induced by an early undetected technical error in methodology.) Yet it never quite fit. Kuhn’s teachings also failed as the behavior he describes applies to both good science and pathological science.

I see now that Threshold Science comprises a different type of pathological science induced by a fundamental cognitive error rather than a technical error
The distinction between Langmuir’s Type I pathological science (fundamental methodological error) and Type II (fundamental cognitive error) is very important. Trusting, dutiful and particulatly the best trained in the discipline will be not be able to detect the error of Type II Pathological Science whereas they may be be able to detect the error of Type I.

Type II pathological science is probagated by text and mentorship. Indeed, since all in the disipline are unknowingly faithful to the pathologic teaching, its practice is mandated to receive funding and to publish in the top journals.

An entire scientific discipline (eg sleep apnea science), may be based on Type II pathological science. They may recognize that the science is not working but they will be complely confused as they cannot understand why.

Here is another good example, mentioned in another data methods thread:

We find that in all cases the provision of HHC increased the probability of readmission of the treated patients. This casts doubt on the appropriateness of the 30-day readmission rate as an indicator of hospital performance and a criterion for hospital reimbursement, as it is currently used for Medicare patients.

Alecos Papadopoulos & Roland B. Stark (2021) Does Home Health Care Increase the Probability of 30-Day Hospital Readmissions? Interpreting Coefficient Sign Reversals, or Their Absence, in Binary Logistic Regression Analysis, The American Statistician, 75:2, 173-184, DOI: 10.1080/00031305.2019.1704873 link

It is a shame this paper is paywalled, but here is a brief blog post by one of the authors:

1 Like

Another good example found on Dr. Harrell’s blog – not sure how I missed it,


I think I’ve found something worth studying more closely:

This resource is widely used for hospital and health insurance quality assessment. It will be educational to look at the guidelines and their sources, and examine them from a decision theoretic perspective.

An early article, discussing the relevance, is here:

Off Topic: I think I finally found the Senn commentary on the FDA 2 trial rule alluded to by @pmbrown above. I should organize these a bit better.


it’s also covered in his book - “12.2.8 The two-trials rule” (2nd edition). I guess that’s what was in my mind

1 Like

This was posted by Frank on Twitter:

Additional motivation to identify and treat sepsis cases lies in the fact that sepsis serves as a system-level quality measure, with hospitals judged by both the by the federal Department of Health and Human Services and the CDC on their sepsis rates. Complicating efforts to reduce sepsis is how difficult it can be to diagnose—both accurately and quickly.

Given what I’ve learned about sepsis from @llynn, this failure of the prediction model isn’t surprising.


No one should be surprised by this. It is, however, tragic on quite a massive scale.

Quote from the article
"He added that Epic isn’t wrong in their analysis. We differ in our definition of the onset and timing of sepsis. In our view, their definition of sepsis based on billing codes alone is imprecise and not the one that is clinically meaningful to a health system or to patients.”

This is the point I have repeated made in this forum. The analyses (the statistics) are perceived as correct but the definition (measurement) is wrong.

That makes little sense in the real world (for example the world of study design by statistician or bridge design by an engineer). In both those cases the measurement is part of the math. It is a continous function.

If, when designing a bridge, the engineer does not consider the structural math of the concrete to be used, she is a fool. When the bridge collapses its not mitigating that her math was correct because her math should be defined by knowledge of the structural math defined by the specified concrete mix. It is a continuous function.

Analysis of measurement is part of the design. Every engineer a bridge is a continuous function. Its strength depends on the weakest math of the function.

So too a study design.


What in the world do billing codes have to do with the prediction of a clinical syndrome? That should be defined purely on anatomical or physiological (and maybe psychosocial) criteria.

1 Like

It’s even much worse than that makes it seem. I don’t think the original analysis had a proper time zero with the use of only true baselines as predictors.


Clinical research with the application of statistics to billing codes as a surrogate for physiologic data was popular a few years ago. They call it clinical research using “administrative data”. In medtwitter I called it #Billing CodeScience" (BCS) and I decried its caprecious derivation and use. In fact, i beleive my twitter based critique of this pathological science brought it down.

To make matters worse, billing coding for sepsis was very poor. Undeterred, to compensate, the “measurement” for sepsis was based on the combination of two billing codes (one for infection and a second for organ failure). This was called the “Angus Method”. Nevermind that the two codes might not have been related.

As I predicted on twitter, billing code science failed and was recently abandoned but BCS left some residual in its wake. It appears that this algorithm may have been one of them.

BCS was, however, very widely used by the top brass with the seminal BCS research published in JAMA for example.

I have been trying to seek help from this group pointing out that mainstream statiticians should not dutifully apply math to a capricious nonreproducible measurements.

“Billing code science” (the use of billing code combinations as surrogates for physiologic signals) shows that the phrase “It is a single function” should be taught at every level.


“When the desirable is not available, the available becomes desirable.”

— Nigerian proverb


I can’t remember where I found it (probably one of Frank’s reference lists), but this paper on risk prediction as preferable to diagnosis (classification) is really where clinical care needs to go, especially with regulators insisting on health systems that get payments engage in financial risk taking for the care provided.

That, in turn, implies that quality assurance and improvement would be involved in the development and implementation of risk prediction (and reduction) models, instead of checklists based on arbitrary dichotomies.

Addendum: A related article on risk adjusted payments for primary care. These are the types of models doctors, nurses, and other professionals on the front lines need. A critical point – educating these people on when the model is not applicable will be critical if outcomes are to be improved.


" Only two studies (n = 1298) measured referrals to domestic violence support services following clinical identification. We detected no evidence of an effect on referrals (OR 2.24, 95% CI 0.64 to 7.86, low quality evidence)."

I think this also maybe an example for wrong interpretation of confidence intervals. Interestingly, this is a cochrane meta-analysis

1 Like

“administrative data” now euphemistically called “real world data” may supplement drug approval claims

“Real-world data (RWD) forms the basis for real-world evidence (RWE) and can be extracted from a broad range of sources such as patient registries, health care databases, claims databases, patient networks, social media, and patient-generated data from wearables” Trial designs using real-world data: The changing landscape of the regulatory approval process

Billing codes were shown to be poor surrogates for physiologic data in sepsis and therefore BCS was abandoned. Consider how meaningless the statistics were in sepsis BCS. All those resources wasted.

However, sepsis is difficult to define objectively even with physiologic data. Obviously administative data has utility but the role of the statistician, IMHO, will be to define its place as a reproducible measurement, relevant the hypothesis under test, before dutifully applying statistical processing.