Sensitivity, specificity, and ROC curves are not needed for good medical decision making

srpitts · January 1, 2019, 12:57am

I remember about 10-20 yrs ago we got an ECG machine that incorporated the ACI-TIPI rule for ACS. In addition to three factors recorded by the ECG machine, you entered four clinical factors, and got a “predictive value”, rather than a sensitivity or specificity. The ECG machine actually printed the estimated probability of ACS on the ECG itself, and it was always alarmingly high. Where we worked this was not helpful because like most inner city public hospital/trauma centers, the prior prob of ACS was pretty low, unlike the hospitals in the Northeast where the rule was derived and validated.

There is some appeal in the sn/sp approach, even when it is not well understood: Doctors tend to think in terms of discrete diseases, which they traditionally spend the entire second year of medical school listing & memorizing (eg 80% of RMSF patients have a headache). Sensitivity is a quantitative description of a disease, tho possibly of little use in discriminating between dz and non-dz. But together, sens&spec to my mind are the “weight” or “yield” of the test, useful for comparing with other tests (eg choose between CT vs cath vs hospital admission, my main task in the ER). Similar tests may involve different degrees of cost or risk, or outcome utilities vary: all part of intuitive decision making. This “yield” is a much more transportable across the medical enterprise (Northeast vs South, office practice, ERs, specialty practice, ICU, urgent care centers, retail clinics, public hospitals, private clinics, etc) than the model-generated posterior probability. Understanding the base rate shift (the “inverted probability” problem) and the concepts of spectrum and verification bias are a couple of fundamental modifying principles that every doc really ought to be taught. AUROC is pretty irrelevant too: useful decision thresholds occur over a limited range of the curve, depending on the clinical circumstance.

DanLane911 · January 1, 2019, 5:34am

@Michael.a.rosenberg I appreciate your comments about the overuse of modelling mortality in hospitalized patients, but one area where I can see value is in using these measures as an assessment of illness severity to guide decision making. Especially in the ICU where the risk of mortality is so high (by design) I could imagine that any measure that indicates a higher risk for mortality could be a stronger indication for aggressive versus supportive treatment.
I think your point about using RCT data to inform big decisions is an important one. But while the original study has more restrictive inclusion criteria by design, I still wonder if there is value in modelling risk of similar patients to help clinicians treat similar patients who may have not met the original inclusion criteria better.

@srpitts This ECG example is really interesting to me. This might be a good illustration of the issue I originally commented on and which you also bring up, specifically how to use this information. While the specific probability estimates you were seeing seem like they were not well calibrated, an understanding of the predictive-value of the ECG algorithm still seems valuable to me; however, it may be more valuable in the context of choosing which test to order next instead of for the specific patient. This might be thought of as similar to the weighting or yield of the test as you mention. I also have had the experience of memorizing what test should be used for which context in the evolution of diagnosing some pathology so I wonder if a better approach to describing this is possible using a predictive approach. For example, I have tried to quantify this in studies using measures like the Gini’s coefficient.
You touch on another key concept I have also noticed too, that doctors tend to think in terms of discrete diseases. While we know this is what patient’s and clinicians expect, my experience so far has been that a discrete disease rarely exists - even just by considering comorbidity’s. So this brings me back to my original question - how do we best convey the most useful information to clinicians, in the context of complicated clinical presentations, while conserving the maximum amount of information possible?

albertoca · January 1, 2019, 2:16pm

Models have to be useful, and for that they have to be goal-oriented. Therefore, the objective is to understand what is not known about a patient, then design a model that tells you what you don’t know. But if you want your model to be useful, development has to be coupled to decision making. This does not happen on many occasions, and it is something that statisticians/mathematicians are not able to grasp without clinical feedback, but my experience is that clinicians themselves do not sometimes understand the problem either. For example, the MASCC score for predicting complications in febrile neutropenia. Febrile neutropenia is a condition in which there are two types of patients, those apparently stable, whose clinical course is unknown, as opposed to the very critical who require urgent treatment or die. The question is not whether you have to intensify therapy for critical patients, which is obvious. The problem is how to individualise the intensity of treatment in non-critical patients. The MASCC score assess them all equally, although the approach to decision-making is not similar at all, which generates contradictions. You may end up reclassifying critically ill patients as low-risk. When all patients are evaluated in a mixed cohort, the sensitivity of the model is between 60-70%, but when only the apparently stable ones are evaluated, which are the ones that really need the prognostic information, the model has a sensitivity of less than 30% (not predict 70% of complications in stable patients that can be sent home for ambulatory treatment). Therefore, yes, I agree that sensitivity, specificity and area under the curve are not the only requirements to evaluate prognostic models, but I also believe that the issue is not only mathematical, but clinical.

f2harrell · January 1, 2019, 2:38pm

What you wrote raises a lot of great issues. A general comment is that in diagnostic research we too seldom involve patients in setting utilities for various actions. I recommend a three-phrase approach:

Invest the time and money in eliciting utiities (my preference is the time tradeoff variety) for various actions and outcomes, from non-diseased individuals who may someday be at risk for the conditions under study
Combine these utilities with validated risk estimates to be able to solve for actions that maximize expected utility
Apply what is learned to real-time decision making. If this isn’t feasible, develop an approximation to the ideal solution that can be used routinely.

The last step isn’t obvious but may motivate someone to find a solution for a particular clinical problem and patient population.

There is another sublety to what you wrote: the correct implication that the initial diagnostic workup may be too high-dimensional to be able to have the probabilities and utilities you need for optimum decisions, and that you need heuristics to narrow things down. This leads to ideas such as ranking diagnoses from most likely to least likely to be true. The problem with that heuristic is that if the probability that the most likely diagnosis holds is less than, say, 0.9, the ranking process will hide uncertainties that are so massive that it will cause downstream errors. So it’s hard to avoid formalizing risk estimation.

Michael you are not alone in making this leap of logic. For reasons detailed here, this logic doesn’t work. This goes back to experimental design in agricultural experiments, which taught us how to design RCTs. In neither agricultural nor medical situations does the worth of the treatment estimate come from representativeness of objects (whether plots of land or patients). I think you are mixing two ideas. The RCT is designed to provide relative therapeutic effectiveness estimates that are applicable way beyond the study’s inclusion criteria, then we combine the relative estimate with absolute risk estimates to get absolute risk reduction estimates for various entertained treatment options. The risk estimates may come from the RCT itself if the inclusion criteria were fairly broad, or slightly more commonly, from a diverse observational cohort.

I believe that backwards-time backwards information-flow probabilities only give the illusion of usefulness, and you can accomplish all that using only predictive mode forward thinking. [Note: Had you admitted that sens and spec varied significantly by patient type I would have been more impressed with this approach.] Sens and spec get medical students thinking indirectly and inefficiently IMHO. But you are quite right that weight or yield of a test is a useful concept. The weights can come solely from a predictive mode rather than a retrospective one though. To elaborate, many experts in medical decision making rightfully believe that diagnostic likelihood ratios (LR) are more helpful than sens/spec. The LR+ is the factor that moves a patient from an “uncertain” pre-test risk to a post-positive-test risk (this is making the huge assumption that the test is binary). LR- is the factor that moves from an initial state of uncertainty to a risk were the test negative. If instead of moving from a rather poorly defined initial state to a +/- state one were to think of predicted risk and odds ratios, things are better defined, and a beautiful thing happens: the odds ratio for a binary test in a logistic model is LR+ times LR-. So the odds ratio may be the weight you seek. One more thing about LRs. The initial point of uncertainty, usually thought of as background disease prevalence, is actually not very well defined. Instead of using a notion of prevalence, a logistic risk model uses a covariate-specific starting point that is easy to comprehend. An odds ratio moves you from a particular starting point (e.g., age=0 or no risk factors present) to the odds of disease given a degree of positivity of a test.

One error we’re making in medical education is the implication that disease prevalence (for use in computing the base shift you speak of) is well-defined and is an unconditional probability. Neither is true. Prevalence gets in the way of understanding as much as it helps. More useful would be an anchor such as “The probability of dz Y for a minimally symptomatic 30 year old male is …”. This is the baseline covariate way of thinking.

f2harrell · January 1, 2019, 2:46pm

I have trouble seeing why retrospective quantities are relevant to this discussion. It seems to me that what the clinical situation is crying out for is:

Having expert clinicians and biostatisticians fully collaborate to develop a predictive-mode prospective cohort-based outcome model that incorporates clinical course up until the current time
Make sure the model contains any clinically-sensitive interactions with treatment
Combine with an RCT that estimated relative efficacy if possible
Combine all this with patient utilities
Don’t use labels like “low risk”. “Low” is in the eyes of the beholder, and ignores patient utilities.
When we can’t elicit patient utilities ahead of time (using normal volunteers, typically) or in real-time from the actual patient or family involved, get utilities (ahead of time) from a panel of medical experts.

srpitts · January 1, 2019, 5:33pm

Appreciate your comments, and got me to thinking a bit more about a topic I really love, because ER docs are the ultimate probability workers, seeing a steady stream of almost undifferentiated patients (septic shock, then vaginal bleed, then persistent cough, then opioid user w a new pain). Sometimes we are so ignorant that the “prior probability” of a given disease is exactly 0.5, but mostly we have a vague idea (“we never see dz X here”). This is indeed an anchor, but it is not based on a covariate vector, or if so, then a very primitive one. A simple rule that most students find completely intuitive is the mnemonic: SNOUT (sensitivity rules out), important when your main job is “not missing a serious case”, like pulmonary embolus. And the obverse is SPIN “Specificity rule in”, also useful in ER when you can “rule in” a trivial self-limited illness like shingles. Even the LRP and LRN may not be worth the additional trouble of looking up or calculating the ratio, and the diagnostic odds ratio is definitely a bridge too far, because it is even harder to interpret, and is only a measure of discrimination rather than calibration. I don’t think clinicians are fooled into taking “spin” and “snout” too seriously. The rule is truly a “heuristic” to be used when you don’t have time to look something up or consult a validated model, i.e. not a formal procedure. The good news is that ER docs as a group more and more consult formal models on-line, on the wonderful website “MDCalc”.

Michael.a.rosenberg · January 1, 2019, 10:03pm

Couldn’t agree more with @albertoca about need for any quantitative approach to be directly tied to a specific question in clinical context. This was my point about prediction of mortality in hospitalized patients; we just don’t think that broadly as clinicians when evaluating a patient, even the critically ill ones. Management is almost always focused on a single or handful of decisions aimed at a accomplishing near-term goals, usually based on understanding of the biological process felt to be relevant at that time. I won’t disagree that there is some level of latent ‘gut feeling’ about how sick a patient is, which sometimes guides how aggressive we might need to be in terms of interventions, but this entity is poorly understood on an objective level (i.e., Malcolm Gladwell’s Blink phenomenon), and thus attempts by outside analysts and statisticians to quantify it tend to be overlie simplistic, and as @srpitts notes often to a comical degree.

Regarding RCT’s, I don’t disagree that the principal reason for conducting an RCT is to use randomness as a surrogate for counterfactual exposure to establish existence and degree of causality. However, like the simplistic SNOUT/SPIN interpretation of sensitivity and specificity for tests run in the ER, there is utility in a simplistic (binary) interpretation of results of an RCT, especially when combined with clinical experience and understanding of physiology and pathophysiology to guide a specific decision for a specific patient. This isn’t to say that this application is the global optimum for evidence-based medicine, but if there’s a better way to use data to guide efficient decision-making, it certainly hasn’t made its presence known to the greater medical community. If you don’t believe me, next time you’re in the doctor’s office, ask them to explain how they use sensitivity and specificity in their decisions. Chance are they can’t do better than SNOUT/SPIN; yet, I imagine you still listen to what they might tell you regarding your health…

My challenge to data analysts out there who think they know how we should be using data better in our day-to-day clinical decision-making is to come join us, spend some time in the clinic and see how practical your methods are in application. I’m by no means making the argument that the process can’t or shouldn’t be improved; in fact, I’m working myself from the other side to find ways to bring data into our regular clinical decisions (hence my presence on this forum). But it turns out that humans tend to be pretty good at processing a lot of information in making decisions, and most attempts to use complicated models for that process fall flat on their faces in real-world application. The QWERTY keyboard is not the most efficient method for typing words into a processor, but the barrier to improvement is high. The stakes in clinical decision-making, where lives are at risk and there is a zero error tolerance, are even higher.

samw235711 · January 2, 2019, 4:29am

Should this be a conditional P(disease X | … ) ?

f2harrell · January 2, 2019, 4:46am

Yes, sorry - am fixing now.

samw235711 · January 2, 2019, 5:56am

For the example from BBR, just considering gender

P(ECG + | CAD +, M) != P(ECG + | CAD +, F)

is this what’s meant by sensitivity “varying over patient types”?

And if

P(ECG + | CAD +, M) = P(ECG + | CAD +, F)

then sensitivity would be considered “constant” over patient types?

Because also

P(CAD + | ECG +, male) != P(CAD + | ECG +, F)

So both sensitivity and the posterior probability are not constant over gender. Is your argument that the posterior is amenable to conditioning on gender, whereas sensitivity, by definition, conditions only on disease status (and that although sensitivity technically could condition on other things, no one does that)?

f2harrell · January 2, 2019, 1:58pm

Yes, well written. Researchers and practicing clinicians have been wrongly taught to assume that sens and spec are constants, i.e., are properties only of the test. Sens and spec would have been useful had they actually been unifying/simplifying constants.

On the other hand, we are taught from the get go that probabilities of disease naturally vary with age, sex, symptoms, etc. And binary and ordinal logistic models handle this automatically with no new developments or complexities needed. Regression models also allow the possibility for unifying constants. When a diagnostic test does not interact with age, sex, symptoms, etc., the regression coefficients for the test apply to all patient types. When the test is binary (almost never happens but is routinely pretended), this is the +:- test odds ratio.

To summarize some of the problems with using backwards-time probabilities sens and spec:

They are functions of patient characteristics and not constants. This is especially true for sensitivity when more advanced disease is easier to detect but the clinician dumbed down the problem by pretending the disease is binary.
Sens and spec are modified by workup/verification bias, e.g., when males are more likely to not get a final diagnosis than female. Applying complex corrections to sens and spec for workup bias and then using Bayes’ rule to get P(disease | X) makes the complex corrections cancel out.

On the last point, starting with the goal of estimating disease probability rather than estimating sens and spec saves time and lowers complexity, while being more intuitive. A win-win-win.

DanLane911 · January 2, 2019, 4:47pm

I absolutely agree that diagnostic test research should be focussed on estimating the individual patient prevalence of disease, conditioned on patient characteristics and tests. Where I still see the challenge is adequately translating this information to clinicians.
In my own training, I have noticed myself and many of my colleagues have just been learning what test to order i.e. differentiating a good test from a bad test, or a screening test from a diagnostic test. Although simplified, I suspect that this is one objective we should keep in mind for these studies by validating discrimination and calibration then @f2harrell recommended using a likelihood ratio or AIC to compare the full regression models.
If calibration is presented as I showed above, with probability on the y-axis and the measure/score on x axis, then this also allows clinicians to get an estimate of the the probability for the outcome for different patients. Again I suspect they will (as I do) remember discrete thresholds and how they will manage a patient differently at these thresholds, but at least these decisions are based on the probabilities generated from a full model not a single point like sens/spec, and it is then the clinician taking into account expected utility in their setting.
I recognize that there is a trade-off between entering too many characteristics into the model (i.e. every clinical measure or physical exam finding), therefore making it too specific and dependent on all measures to be accurate, versus entering less measures so it is more generalizable and therefore sacrificing accuracy. My approach to navigating this trade-off so far has been to use the information likely to be available to the clinician at the time of applying the test (which is minimal for me since I focus on EMS/prehospital). Then if we are focussing on the slope of the line (i.e. odds ratio) as discussed above instead of just the absolute estimates, we are left with a good assessment of how much weight the measure should provide to our decision making.

srpitts · January 3, 2019, 4:13pm

Well you start off with nothing. Then you get more information, like gender, (analogous to first probability revision) which you build in to your initial mental picture = P(Dz|Findings in first 5 minutes). The important “findings” worthy of more formal study and quantitative analysis are 1) decision rules, and 2) single findings that have potentially giant dx odds ratios (eg febrile pt is from Congo in 2019), or are risky (eg central venous pressure), or expensive (imaging, lab). These are the tests that sometimes get published with LRs, making the assumption that you use them wisely, and often in tables - when there is a lot of heterogeneity, eg by sex. Not unlike the HR in an RCT done in a population not like your patient. I’ll grant you that the idea of tabulating compendia of LRs as if they were constant properties of a given test/dz couple has faltered in the commercial sense: I think that the book “Decision making in imaging” is probably out of print, and certainly out of date, but mainly because of the march of progress, i.e. the moving target of technology assessment in the modern era.

RonanConroy · January 3, 2019, 5:09pm

It’s important to distinguish between sensitivity/specificity and predictive value. Sens/spec have a role, but in screening, where the objective is to winnow out the people with the problem from a clinical population. However, if you are dealing with decisions about individual patients, then they are not a great deal of help.

For example, nuchal translucency scanning in pregnancy was only 72% sensitive and 96% specific. The positive predictive value was just 5·3%. But here’s the thing : the negative predictive value was 99·9%. Because it’s non-invasive, it can winnow down your original population by about 95%. The remaining 5% need to be reassured that most positive NT scans are false positives, and some of them will certainly go forward for more invasive testing. However, the value of the test lies in its ability to give a high degree of reassurance to those 95% of couples with negative scans.

So perhaps sens/spec are getting criticised for doing a job poorly that they are not meant to do at all.

It’s also worth noting that statisticians operate on what Gerd Gigerenzer calls “The God Model” where all relevant information is available simultaneously and there are computational resources to fit it into a prediction model. In real life, information is generally available sequentially. A patient presenting with chest pain will present with information about age and sex right away, and getting an ECG done is quick. If the ECG shows an MI, then treatment must be initiated immediately, without waiting to see what the cardiac enzymes look like. So clinical decision making tends to operate on a frugal tree model, where decisions are made by examining the predictive factors one at a time.

In this respect, what many statisticians need to learn is how clinical decisions are made. Otherwise we will build endless ‘hobby’ models that will never really impact on clinical practice. I say this rather gloomily, reflecting that after 50 years of functions to estimate cardiovascular risk, a very large proportion of people are being managed ‘sub-optimally’ (a medical term indicating the sort of train wreck that you shouldn’t discuss aloud in front of the patient or their grieving relatives).

f2harrell · January 3, 2019, 5:47pm

Well said Ronan. On the possible use of sens and spec for screening, I have doubts even there. I think that pre- and post-test probabilities of disease are simpler concepts and more directly actionable.

mwebb · January 3, 2019, 5:49pm

Frank, you bring up some good points as to why sens and spec are not useful. Backwards-time backwards-information-flow probabilities/ transposed conditionals resonate most with me.

In literature, sens and spec are described as being properties of the test, independent of prevalence - this is advantage I usually see of sens and spec over PPV and NPV. This does not make much sense to me. It seems to me that if the test is developed and validated using an appropriate population, then the predictive statistics don’t need to be independent of prevalence. Could someone explain why this is or is not really an advantage?

I think a ROC-like curve with an AUC statistic would be more useful if the two axes were PPV and NPV. Why don’t I ever see this used?

f2harrell · January 3, 2019, 6:59pm

To the last question, it would no longer be an ROC curve

The independence of prevalence is a hope (and is sometimes true) because it allows a simple application of Bayes’ formula to be used, if sens and spec are constant. The fact that sens and spec depend on covariates means that this whole process is an approximation and may be an inadequate one.

Raj · January 4, 2019, 4:05pm

This entire paragraph is fantastic, and worth re-reading multiple times. Thank you Frank.

f2harrell:

But you are quite right that weight or yield of a test is a useful concept. The weights can come solely from a predictive mode rather than a retrospective one though. To elaborate, many experts in medical decision making rightfully believe that diagnostic likelihood ratios (LR) are more helpful than sens/spec. The LR+ is the factor that moves a patient from an “uncertain” pre-test risk to a post-positive-test risk (this is making the huge assumption that the test is binary). LR- is the factor that moves from an initial state of uncertainty to a risk were the test negative. If instead of moving from a rather poorly defined initial state to a +/- state one were to think of predicted risk and odds ratios, things are better defined, and a beautiful thing happens: the odds ratio for a binary test in a logistic model is LR+ times LR-. So the odds ratio may be the weight you seek. One more thing about LRs. The initial point of uncertainty, usually thought of as background disease prevalence, is actually not very well defined. Instead of using a notion of prevalence, a logistic risk model uses a covariate-specific starting point that is easy to comprehend. An odds ratio moves you from a particular starting point (e.g., age=0 or no risk factors present) to the odds of disease given a degree of positivity of a test.

One error we’re making in medical education is the implication that disease prevalence (for use in computing the base shift you speak of) is well-defined and is an unconditional probability. Neither is true. Prevalence gets in the way of understanding as much as it helps. More useful would be an anchor such as “The probability of dz Y for a minimally symptomatic 30 year old male is …”. This is the baseline covariate way of thinking.

Raj · January 4, 2019, 4:19pm

Agree. In my experience these concepts are not of much practical value, and most docs have to look up the terms to interpret studies. Which we do rather poorly.

boback · January 4, 2019, 9:56pm

The main criticism I’m hearing is regarding how we teach the ideas around diagnostic evaluations. Ideally, we would have a probabilistic statement regarding the presence or absence of disease given a test result. The shortcoming of this approach is that we lack predictive models for various combinations of tests and patient factors. In general, a physicians experience is one where a test deemed excellent, good, or poor for meeting some diagnostic criteria. We usually base our assumptions on a test based on how sensitivity/specificity were reported for a test in a select cohort or case/control type evaluation.

Again, I think understanding direct probabilities is important and perhaps emphasizing the points you make are relevant to how we report diagnostic values in the literature. But from a practical sense, we are long way from changing how we clinically interpret test results. The huge spectrum of patient presentations, values, and treatment decisions won’t fit into neat probabilistic tools. I would like to see better integration of personalized risk for both patients and clinicians to guide treatment decisions. But I don’t believe this issue will change how we decide to respond to a troponin level vs. non-specific testing like ESRs or d-dimers.