Threshold Science, OSA, and the Statisticians Role

I have discussed the need for statisticians to assure that the measurements to which they apply statistics are reproducible and based on evidence rather than opinion.

This is what happens when they do not do that and simply dutifully do the math using the measurements they are given.

Now everyone believes that Obstructive Sleep Apnea (OSA) is morbid. But OSA is quantified by a set of thresholds guessed in the 1970 & 80s. This is the “gold standard” Apnea Hypopnea Index.

Essentially this is a measurement which is based on addition of threshold 10 second events of various types and durations. A bucket of apples and oranges which have one thing in common…they are all at least 10 seconds long.

Here’s the calculation for an OSA RCT.

Count the number of 10 second airflow attenuations per hour of sleep. 5-15 mild, 15-30 moderate, greater than 30 Severe.

Sounds fairly easy to guess and it was. It works clinically because it is highly sensitive for OSA …but it doesn’t work as a gold standard because its correlation to the actual severity of pathophysiologic perturbations which occur during OSA is poor. For example a 10 second partial breathold counts the same as a 2 minute breathold with an oxygen desat into the 50s.

Yet 35+ years of RCT. 35+ years of compliant statisticians dutifully applying the math. Running the LR. Adjusting for who knows what. All a waste.

You see you cant guess a set of threshold criteria for measuring a disease. You can’t guess a duration like “10 seconds”. That is the number of fingers on the guessors hand but that’s not a clue.

Worse its hard to quantify complex pathophysiology by addition. That worked in ancient Egypt for coins (if they are of the same type) but not for human pathophysiology.

So I discovered the gold standard method of counting 10 second events was guessed when I was studying OSA in 1997. After a retrospective study I funded at OSU failed I revieed that raw data and I could see from the data that the AHI (which I thought was a solid gold standard) was caprecious so I decided to seek its origin. The internet was not available so a medical librarian and I tracked down the souce to Western J of Med 1976… Amazing the gold stand was guessed without data.

I thought the discovery that the gold standard for OSA was guessed was a revelation. No one cared. The RCTs flowed.

I said to the scientists “you will not be able to quantify the disease by using a guess”. “Eventually you will be unable to prove your own disease (OSA) is morbid.” No one cared. Yet, indeed, that is what has now happened.

Now what I predicted has proven true. Yet statiticians could have prevented this waste by simply asking for the origin of the gold standard (AHI) and asking for evidence of reproducibility/variability of the AHI gold standard. As the AHRQ article shows, the AHI is poorly reproducible and varys widely depeding on the definition of the hypopnea.

OSA science was, perhaps, the original threshold science. “Threshold Science” is a pathological science which emerged in the 1970s and uses guessed threshold sets as gold standards in RCT.

OSA science is the prototypic threshold science. As Langmuir described, pathological science looks real. The stats are sound but there is a flaw in the scientific method which occured in the past and no one knows it. (Here, the flaw was the guessed origin of the gold standard.) Often there are reinforcing temporary positive results due to threshold interactions. Yet, the science ultimately fails due to intractable nonreproducibility. It may take decades. With OSA it has taken nearly 4 decades.

Of course, OSA IS morbid. All pulmonary docs know that. In addition CPAP is the best treatment. Why can’t we prove that?
The reason is that OSA was conflated with the guessed AHI. The AHI is a highly variable guessed measurement which is NOT morbid.

Now what role did the staticians play? Well they applied LR and adjustments to a guessed set of thresholds. That does not make statistical sense. Its hard to blame them. Its not their job to investigate gold standards but maybe thats the lesson here. Perhaps, going forward, some due diligence relavant the gold standard should be a general requirement before agreeing to do the math. Then you are not questioning the investigator or the science, just following the rules.

In medicine we now have checklist to assure we do not miss an important care component.

Perhaps statisticians should have a checklist.

Gold Standard based on evidence- check
Gold Standard is reproducible- check


During a consultation in biomedical research one of the very first questions I ask is how was the response variable determined. The second, if a threshold was involved, is the question of how the threshold was validated. After the researcher refers me to the fundamental paper I check for a validation of said threshold. To this day I’ve never seen a single one validated. Thanks for writing so clearly about this problem and the need for statisticians to stand up. @Stephen wrote an article encouraging statisticians to be intimately involved in the understanding and deriving of measurements. Some statisticians criticized him for this saying it was not in our domain!


I do not like the buzzword “validated” - validated is when some papal authority or KOL - Key opinion leader - says so. I ask “why are you using this threshold”? Quoting FH’s “Don’t dichotomize too early”.

And the answer is: “O, please, try all threshold and tell me what is the best.”

I have become bit cynical on this subject.


I have learned from @f2harrell the hazards of thresholds. Yet thresholds, determined for example using AUC are helpful in clinical medicine. In the end there is often a seemingly dichotomous decision to make. Infuse drug or dont infuse drug. Surgery or no surgery. But as pointed out its not quite a dichotomy as there are also other options such as delay decision, retest after time, etc.

“Threshold Science” is a little different.

In Threshold Science the gold standard for the disease itself (for example OSA) or the primary or secondary endpoint (for example SOFA), are defined by one or more thresholds which were not derived from a trial but rather guessed by experts.

In an RCT investigating a treatment we have two measurements separated by the treatment.

–A first measurement before treatment to determine the presence of the disease under test and,
–A second measurement after treatment to determine a prespecified response to the treatment.

If either the first or second measurement is defined by a guessed set of thresholds this is 20th century “Threshold Science”.

Threshold Science was very common in the 1980s esp. in the field of cardiology where it is no longer practiced. It has mostly faded elsewhere but it still has a wide following in some disciplines and is supported (probably as a function of naivete) by some prominent statiticians.

If you investigate the history of threshold science its a stange journey. We know that a disease may be reasonably defined by a threshold derived from data (eg using AUC). We also know that some diseases must be defined by expert opinion. The problem comes when expert opinion is used to replace the AUC and a real hard threshold number is derived from expert opinion rather than from AUC. Thats where it turns into pseudoscience because the guessed numbers are now available for math such as addition and for next level cutoffs (such as thirty 10 sec attenuations/hr). Guessing a set of real numbers and then performing math on those numbers embellishes the guess and is a form of pathological science (Langmuir) like Chiropractic and Homeopathic medicine.

Next time ask yourself, Is this “Threshold Science”. If it is you are wasting your time doing the stats between the measurements and you could help the PI to avoid wasting her precious time by teaching her about this fading, but still very real, pitfall.

I assume you mean thresholds determined using ROC. Since ROCs are tailored to retrospective case-control data and not cohort data (since they condition on the outcome and sample from X) they are not consistent with decision making. And a threshold would only be helpful for dichotomous decisions where the utility function is identical for every physician/patient pair.

1 Like

The paper linked in your post is pretty painful to read. I just skimmed the first part and looked at the tables summarizing RCTs that report CPAP’s effect on important clinical outcomes.

I think any clinician who sees patients with OSA would agree that it’s a bad thing- patients feel terrible and their spouses (sleeping in another house) feel terrible. Once their OSA is diagnosed and treated, many are able to lose weight, gain enough energy to start exercising, and are able to get off some of their blood pressure medicines. And we all know the strong association between sleep apnea and atrial fibrillation (another complicated, poorly-understood condition for which treatment decisions hinge on a scoring system that’s chronically under fire).

Seeing all these improvements in the office repeatedly is so compelling that it’s hard to believe that we’re not just seeing the tip of the iceberg in terms of CPAP’s benefits. Surely, treating this condition must also have more profound benefits (?) My bias is also to believe that this is the case- so why has it been so hard to prove this with RCTs? Are doctors deluding themselves en masse, or does the problem lie with the design of RCTs that have been used to study this condition over the years?

As you point out, if previous OSA RCTs enrolled patients based on them meeting a threshold score (i.e., AHI) that doesn’t reliably distinguish patients with prognostically-poor OSA from those with better prognoses, then this is unfortunate. This would mean that we haven’t yet given CPAP a fair chance to show its full value. But even if we could somehow develop a more valid way to distinguish prognostically bad OSA from “not so bad” OSA, thus permitting us to enrich RCTs with prognostically poor subsets of OSA patients, wouldn’t other factors still preclude conduct of a definitive RCT in this area ?

When I stop and think about how best to study OSA, it’s clear that this isn’t easy. Blinding the patient to treatment isn’t possible. Randomizing some to sham CPAP (e.g., settings too low to control their apneas) wouldn’t be enough- spouses would remain sleepless and patients treated with sham would continue to fall asleep repeatedly through the day (though perhaps blinding isn’t as important if the focus is on hard outcomes like death, MI, and stroke). And, given how large and long trials would need to be in order to accrue a sufficient number of events of interest, coupled with driving prohibition for patients with untreated severe OSA (at least in my country), would it even be ethically possible not to treat severe OSA for the duration of a well-designed RCT?

The problems above seem insoluble. So maybe it’s enough to try to demonstrate that people feel and function better when their OSA is controlled (though even these outcome measures probably rely on thresholds…) ?


Yes, thanks for the clarification. In retrospective sepsis evaluation the AUC is used to define the best combination of markers for generating the threshold. The actual threshold is selected using ROC usually with titration to very high sensitivity given the risk of delay and the low risk of antibiotics. We know the severe limitations of this as there are sepsis phenotypes which will vary with the sample X so as you point out all of this is just an attempt to improve on the old thresholds. One approach we take to avoid thresholds and the sample X problem is to show instability indications as images which expand and contract, changing colors as the severity of the lab values and vitals worsenins or recovers but in clinical medicine at some point an alert has to be sent to someone

Of course I agree with the limitations of ROC and thresholds, but I do not argue against them wholesale.
It would seem that thresholds should be derived from the study of relevant retrospective data and then tested prospectively. If thresholds are to be promulgated (with the appropriate teaching relevant their sample X specific and utility specific limitations) If that is not correct, how should they be derived?

However the point of the post is that "Threshold Science " which uses guessed thresholds as gold standards (against which everything else is tested in some disciplines like sleep medicine) is a problem which must be engaged if progress is to be made,

Here is quote from Dr. Punjabi in 2016.

“If the AHI remains the “holy grail” in sleep and respiratory medicine, the science will certainly not advance, which in turn will retard the clinical and public health response to this disease.”

They don’t seem to understand that they are practicing “Threshold Science” and this is why the AHI doesn’t work as a research metric… they just know it doesn’t work. They need guidance.

1 Like

I agree. I was trained to read PSG around 1985. CPAP was a wooden box. Patients with OSA were so severe yet 99.9% were undiagnosed and for those who were profound (as with refractory right heart failure and SPO2 cycles in the 50s) a primary treatment was tracheostomy.

CPAP should have rendered the Nobel prize. Many had refractory bilateral leg edema with chronic cellulitis (called “lymphedema” because the cardiac echo was negative). Other had severe right heart failure. Many could not stay awake for more than a couple hours.

The discovery of OSA was ahead of the science of monitoring. We counted apneas with a thumb counter as we paged through inch high stacks of 30 second epochs of paper data. Counting 10+ second events was what we could do. This counting worked clinically because there was one treatment for all regardless of severity. Unfortunately the counting was turned into a count based “Threshold Science” which was common in the 80s and this counting as a RCT gold standard and severity metric was codified and persists in 2021.

I know we are not fooled en mass but in some ways I feel uncomfortable when I hear the general statement “In medicine we should treat without trial based evidence”. That’s not the way we always practice medicine. Maybe that’s because sometimes we don’t know how to do the studies to prove that which we see with our eyes is the truth.

However, the problem here is clear. 1980s “Threshold Science” does not work.


Regarding CPAP for OSA, my default position, like with all studies, is to stay faithful to the raw data. The threshold of OSA is to my knowledge unfounded, and reducing severe OSA to moderate OSA would be an accomplishment.

1 Like