Statistical Misinterpretations and Clinical Practice Guidelines: any examples?

R_cubed · June 23, 2021, 2:47pm

What in the world do billing codes have to do with the prediction of a clinical syndrome? That should be defined purely on anatomical or physiological (and maybe psychosocial) criteria.

f2harrell · June 23, 2021, 3:10pm

It’s even much worse than that makes it seem. I don’t think the original analysis had a proper time zero with the use of only true baselines as predictors.

llynn · June 23, 2021, 5:16pm

Clinical research with the application of statistics to billing codes as a surrogate for physiologic data was popular a few years ago. They call it clinical research using “administrative data”. In medtwitter I called it #Billing CodeScience" (BCS) and I decried its caprecious derivation and use. In fact, i beleive my twitter based critique of this pathological science brought it down.

To make matters worse, billing coding for sepsis was very poor. Undeterred, to compensate, the “measurement” for sepsis was based on the combination of two billing codes (one for infection and a second for organ failure). This was called the “Angus Method”. Nevermind that the two codes might not have been related.

As I predicted on twitter, billing code science failed and was recently abandoned but BCS left some residual in its wake. It appears that this algorithm may have been one of them.

BCS was, however, very widely used by the top brass with the seminal BCS research published in JAMA for example.

I have been trying to seek help from this group pointing out that mainstream statiticians should not dutifully apply math to a capricious nonreproducible measurements.

“Billing code science” (the use of billing code combinations as surrogates for physiologic signals) shows that the phrase “It is a single function” should be taught at every level.

davidcnorrismd · June 23, 2021, 11:40pm

“When the desirable is not available, the available becomes desirable.”

— Nigerian proverb

R_cubed · June 24, 2021, 5:07am

I can’t remember where I found it (probably one of Frank’s reference lists), but this paper on risk prediction as preferable to diagnosis (classification) is really where clinical care needs to go, especially with regulators insisting on health systems that get payments engage in financial risk taking for the care provided.

That, in turn, implies that quality assurance and improvement would be involved in the development and implementation of risk prediction (and reduction) models, instead of checklists based on arbitrary dichotomies.

Addendum: A related article on risk adjusted payments for primary care. These are the types of models doctors, nurses, and other professionals on the front lines need. A critical point – educating these people on when the model is not applicable will be critical if outcomes are to be improved.

eliast · June 24, 2021, 10:58am

" Only two studies (n = 1298) measured referrals to domestic violence support services following clinical identification. We detected no evidence of an effect on referrals (OR 2.24, 95% CI 0.64 to 7.86, low quality evidence)."

I think this also maybe an example for wrong interpretation of confidence intervals. Interestingly, this is a cochrane meta-analysis

pmbrown · June 24, 2021, 3:11pm

“administrative data” now euphemistically called “real world data” may supplement drug approval claims

“Real-world data (RWD) forms the basis for real-world evidence (RWE) and can be extracted from a broad range of sources such as patient registries, health care databases, claims databases, patient networks, social media, and patient-generated data from wearables” Trial designs using real-world data: The changing landscape of the regulatory approval process

llynn · June 24, 2021, 4:21pm

Billing codes were shown to be poor surrogates for physiologic data in sepsis and therefore BCS was abandoned. Consider how meaningless the statistics were in sepsis BCS. All those resources wasted.

However, sepsis is difficult to define objectively even with physiologic data. Obviously administative data has utility but the role of the statistician, IMHO, will be to define its place as a reproducible measurement, relevant the hypothesis under test, before dutifully applying statistical processing.

R_cubed · June 24, 2021, 5:06pm

I’m somewhat surprised that anyone took this proposal seriously. It seems like a category error to me; like asking about the chemical properties of oil or metals by studying futures prices.

I bet they used “machine learning” to derive it, however.

pmbrown · June 25, 2021, 4:16am

not completely abandoned - i just did such analysis. Im no longer involved, but i left a bunch of completed estimates with people who will try to publish them. Issue noted here re reproducible What is a fake measurement tool and how are they used in RCT - #29 by pmbrown Not sure what can be done

llynn · June 27, 2021, 10:48pm

There was formal teaching to use BCS in sepsis so no one can blame anyone that performed this type of research. BCS for sepsis emerged during the heady era of “big data”.

It probably works in some better defined fields. Admonitions, and the article below, precipitated the decline of sepsis billing code science.

QUOTE FROM THE CONCLUSION
“…in contrast to claims-based analyses, neither the incidence of sepsis nor the combined outcome of death or discharge to hospice changed significantly between 2009-2014. The findings also suggest that EHR-based clinical data provide more objective estimates than claims-based data for sepsis surveillance.”

The last sentence states well that which I predicted on twitter.

Of course, in retrospect, BCS for sepsis should have been tested against EHR clinical data before it was promulgated as a sepsis research method by experts.

When research methods are found to be flawed that should be promulgated. It should not require a pubmed search to learn that a well established tool is no longer considered valid.

llynn · June 28, 2021, 1:01am

Here is another example of apparent adverse guideline effect induced by misinterpretation or incorrect application of statistics.

The initial practice guidelines for COVID-19 were based on the estabished guidelines for treatment of Adult Respiratory Distress Syndrome (ARDS) (as due to influenza).

This statistically misinterpreted study shows why there was confidence in this approach.

Note the raw mortality is not significantly different but encouragingly the “SOFA score-adjusted mortality of H1N1 patients was significantly higher than that of COVID-19 patients, with a rate ratio of 2.009 (95% CI, 1.563-2.583; P < .001).”

In march 2020, not to worry!

We now know this was misleading and that modification of the guidelines for COVID-19 were required but this was delayed. Many were probably compelled by the SOFA adjusted P value here.

The problem is that SOFA is a “fake” (guessed) one-size-fits-all measurement tool for RCT for reasons I have noted in this forum.

Looking deeper, why would adjusting for SOFA be so misleading?

SOFA is a summation score derived from ordinal thresholds of 6 signals.

https://www.google.com/url?sa=t&source=web&rct=j&url=https://www.mdcalc.com/sequential-organ-failure-assessment-sofa-score&ved=2ahUKEwjskMueh7nxAhWDFlkFHaooDjkQFjACegQIGhAC&usg=AOvVaw2H1jBD1fkPUCqZHwK6xXqJ&cshid=1624839636768

Of these 6 signals the one most likely to be perturbed in COVID19 is the PaO2/FIO2. But the PaO2/FIO2 is a volatile signal and affected by reversible things like low PEEP or mucous pluging. In many cases PaO2/FIO2 is readily correctable. Therefore the timing of the PaO2/FIO2 (early or late in the care) greatly influences the mortality prediction. Here we see influenza (H1N1) cases presented with lower PaO2/FIO2.

The “take home point” of the editor published in March 2020 in the prestigious journal Chest was misleading.

"Interpretation:

Compared with H1N1, patients with COVID-19-induced ARDS had lower severity of illness scores at presentation and lower SOFA score adjusted mortality."

The article also suggests (without data presented) that corticosteroids were not beneficial and suggested they may be harmful. (Later corticosteroids became the standard of care.)

This is how adjustment with a guessed (1996) traditional score (which has reached standard use by PI & statiscians) can be misleading.

Yet SOFA was originally guessed for use with sepsis and we have discussed its limitations elsewhere in this forum. Clearly adjustments like this can produce misleading results and adversly effect public policy and medical guidelines.

This is a form of “Threshold Science”. Threshold science is not the use of thresholds in science as this is a compromise often made. Rather “Threshold Science” is a specific subset of science. It is a strange science which emerged in the late 1970s and 80s and uses of guessed threshold sets like SOFA as gold standards (criteria), independent variables, adjustments, or outputs in RCT.

Here is an example of the harm threshold science can render.

If anyone would like to learn more about this unique pathologic science please message me.

I have been researching threshold science for over a decade and have a robust archive to share. This would be a powerful overarching and nascent topic for publication.

pmbrown · June 28, 2021, 10:40am

i find this intriguing and wich i had more detail:

Orphazyme Arimoclomol Needs Better Evidence In NPC

"The FDA questioned the clinical severity scale that served as the primary endpoint for the pivotal trial of Orphazyme A/S’s Miplyffa (arimoclomol) in a complete response letter announced 18 June.

The arimoclomol NDA for treatment of Niemann-Pick disease type C (NPC), a rare lysosomal storage disease, rests on a single Phase II/III using the 5-domain NPC Clinical Severity Scale (NPCCSS) to determine disease progression for a primary endpoint.

In the CRL, the FDA requested “additional qualitative and quantitative evidence to further substantiate the validity and interpretation” of the scale, “in particular, the swallow domain,” the company reported. The NPCCSS evaluates the five domains considered most clinically relevant: ambulation, swallow, cognition, speech and fine motor skills. "

https://pink.pharmaintelligence.informa.com/PS144544/Keeping-Track-US-FDA-Issues-Alzheimers-Breakthrough-Designations-Arimoclomol-CRL-Rinvoq-Delays

pmbrown · June 28, 2021, 10:50am

but in EHR they flag using icd codes which are also susceptible to ‘politics’ and bias?

R_cubed · June 28, 2021, 11:00am

This is a recent validation report for the scale mentioned. Looks rather small in terms of sample size to me.

llynn · June 28, 2021, 12:29pm

Yes.

BCS was used to show that sepsis guidelines had reduced mortality as percent mortality had fallen over time. However the number of sepsis billing codes had greatly increased.

I pointed out that obviously the fall in percent mortality could be an artifact of increased billing coding due to the sepsis awareness campaign. This was confirmed in the article I cited.

This is the classic mistake of using historical controls and defining outcome as a ratio since the denominator may not be the same.

The issue of bias is correct as sepsis science is one of the last remaining of the “threshold sciences”. Indeed the 1990s guessed thresholds (two SIRS thresholds + evidence of infection) used to diagnose sepsis during the prior study period has now been abandoned for use in RCT as it was found to have low specificity for sepsis.

Regretably the new citeria is based on SOFA (another 1990s guessed threshold set). No one knows the actual AUC of SOFA for sepsis as it is the gold standard by decree. Thats the way threshold science works…by decree. Then you apply the stats.

pmbrown · June 28, 2021, 2:59pm

i assume “threshold science” is burgeoning activity with the “5-domain NPC Clinical Severity Scale (NPCCSS)” i mention above a possible example. With sofa it must be straightforward to run data simulations that illustrate your concerns? backed up with the reportedly low sensitivity/specificity

pmbrown · June 28, 2021, 3:09pm

seems analogous to the threshold science @llynn is concerned about: “The disease severity ranged from ≤ 4 (n = 7); 5–19 (n = 36); and ≥ 20 (n = 6) on the 5-domain NPCCSS.”

ive dealt with similar scales now and again eg in cognition (huge amount of multiplicity with subscales etc etc, many forest plots), we need to push the clinical people towards Frank Harrell’s blog posts. I checked the literature now and again and no one was using PO modelling for the scales i was dealing with

R_cubed · June 28, 2021, 3:11pm

Rehabilitation science is littered with these improper types of analyses. I only had vague intuitions of it when I was in school years ago.

llynn · June 28, 2021, 5:23pm

This is difficult because delta 2 SOFA is now the gold standard. The simulation would require comparison with some other standard, perhaps expert review of cases. It would require a large multicenter data seta, time sero selection and I doubt it would be funded.

Its is very hard to define critical illnesses and multisystem disease. It would be ideal if we had a single measurement tool for that purpose. This is why threshold science exists.

In this article you see that SOFA is considered to have very broad utility in many different types of diseases and for a wide range of purposes in many types of RCT. SOFA can be used as a gold standard (diagnsotic criteria, independent variable, primary or secondary endpoint, and/or adjustment.

It all sounds good until you read the pages of limitations and confounders about which, they acknowledge, there is little understanding and it unravels when you see the adverse effect in a study like the one I cited for COVID -19.

.