Sensitivity, specificity, and ROC curves are not needed for good medical decision making

I absolutely agree that diagnostic test research should be focussed on estimating the individual patient prevalence of disease, conditioned on patient characteristics and tests. Where I still see the challenge is adequately translating this information to clinicians.
In my own training, I have noticed myself and many of my colleagues have just been learning what test to order i.e. differentiating a good test from a bad test, or a screening test from a diagnostic test. Although simplified, I suspect that this is one objective we should keep in mind for these studies by validating discrimination and calibration then @f2harrell recommended using a likelihood ratio or AIC to compare the full regression models.
If calibration is presented as I showed above, with probability on the y-axis and the measure/score on x axis, then this also allows clinicians to get an estimate of the the probability for the outcome for different patients. Again I suspect they will (as I do) remember discrete thresholds and how they will manage a patient differently at these thresholds, but at least these decisions are based on the probabilities generated from a full model not a single point like sens/spec, and it is then the clinician taking into account expected utility in their setting.
I recognize that there is a trade-off between entering too many characteristics into the model (i.e. every clinical measure or physical exam finding), therefore making it too specific and dependent on all measures to be accurate, versus entering less measures so it is more generalizable and therefore sacrificing accuracy. My approach to navigating this trade-off so far has been to use the information likely to be available to the clinician at the time of applying the test (which is minimal for me since I focus on EMS/prehospital). Then if we are focussing on the slope of the line (i.e. odds ratio) as discussed above instead of just the absolute estimates, we are left with a good assessment of how much weight the measure should provide to our decision making.


Well you start off with nothing. Then you get more information, like gender, (analogous to first probability revision) which you build in to your initial mental picture = P(Dz|Findings in first 5 minutes). The important “findings” worthy of more formal study and quantitative analysis are 1) decision rules, and 2) single findings that have potentially giant dx odds ratios (eg febrile pt is from Congo in 2019), or are risky (eg central venous pressure), or expensive (imaging, lab). These are the tests that sometimes get published with LRs, making the assumption that you use them wisely, and often in tables - when there is a lot of heterogeneity, eg by sex. Not unlike the HR in an RCT done in a population not like your patient. I’ll grant you that the idea of tabulating compendia of LRs as if they were constant properties of a given test/dz couple has faltered in the commercial sense: I think that the book “Decision making in imaging” is probably out of print, and certainly out of date, but mainly because of the march of progress, i.e. the moving target of technology assessment in the modern era.

1 Like

It’s important to distinguish between sensitivity/specificity and predictive value. Sens/spec have a role, but in screening, where the objective is to winnow out the people with the problem from a clinical population. However, if you are dealing with decisions about individual patients, then they are not a great deal of help.

For example, nuchal translucency scanning in pregnancy was only 72% sensitive and 96% specific. The positive predictive value was just 5·3%. But here’s the thing : the negative predictive value was 99·9%. Because it’s non-invasive, it can winnow down your original population by about 95%. The remaining 5% need to be reassured that most positive NT scans are false positives, and some of them will certainly go forward for more invasive testing. However, the value of the test lies in its ability to give a high degree of reassurance to those 95% of couples with negative scans.

So perhaps sens/spec are getting criticised for doing a job poorly that they are not meant to do at all.

It’s also worth noting that statisticians operate on what Gerd Gigerenzer calls “The God Model” where all relevant information is available simultaneously and there are computational resources to fit it into a prediction model. In real life, information is generally available sequentially. A patient presenting with chest pain will present with information about age and sex right away, and getting an ECG done is quick. If the ECG shows an MI, then treatment must be initiated immediately, without waiting to see what the cardiac enzymes look like. So clinical decision making tends to operate on a frugal tree model, where decisions are made by examining the predictive factors one at a time.

In this respect, what many statisticians need to learn is how clinical decisions are made. Otherwise we will build endless ‘hobby’ models that will never really impact on clinical practice. I say this rather gloomily, reflecting that after 50 years of functions to estimate cardiovascular risk, a very large proportion of people are being managed ‘sub-optimally’ (a medical term indicating the sort of train wreck that you shouldn’t discuss aloud in front of the patient or their grieving relatives).


Well said Ronan. On the possible use of sens and spec for screening, I have doubts even there. I think that pre- and post-test probabilities of disease are simpler concepts and more directly actionable.

Frank, you bring up some good points as to why sens and spec are not useful. Backwards-time backwards-information-flow probabilities/ transposed conditionals resonate most with me.

In literature, sens and spec are described as being properties of the test, independent of prevalence - this is advantage I usually see of sens and spec over PPV and NPV. This does not make much sense to me. It seems to me that if the test is developed and validated using an appropriate population, then the predictive statistics don’t need to be independent of prevalence. Could someone explain why this is or is not really an advantage?

I think a ROC-like curve with an AUC statistic would be more useful if the two axes were PPV and NPV. Why don’t I ever see this used?

To the last question, it would no longer be an ROC curve :grinning:

The independence of prevalence is a hope (and is sometimes true) because it allows a simple application of Bayes’ formula to be used, if sens and spec are constant. The fact that sens and spec depend on covariates means that this whole process is an approximation and may be an inadequate one.

This entire paragraph is fantastic, and worth re-reading multiple times. Thank you Frank.


Agree. In my experience these concepts are not of much practical value, and most docs have to look up the terms to interpret studies. Which we do rather poorly.

1 Like

The main criticism I’m hearing is regarding how we teach the ideas around diagnostic evaluations. Ideally, we would have a probabilistic statement regarding the presence or absence of disease given a test result. The shortcoming of this approach is that we lack predictive models for various combinations of tests and patient factors. In general, a physicians experience is one where a test deemed excellent, good, or poor for meeting some diagnostic criteria. We usually base our assumptions on a test based on how sensitivity/specificity were reported for a test in a select cohort or case/control type evaluation.

Again, I think understanding direct probabilities is important and perhaps emphasizing the points you make are relevant to how we report diagnostic values in the literature. But from a practical sense, we are long way from changing how we clinically interpret test results. The huge spectrum of patient presentations, values, and treatment decisions won’t fit into neat probabilistic tools. I would like to see better integration of personalized risk for both patients and clinicians to guide treatment decisions. But I don’t believe this issue will change how we decide to respond to a troponin level vs. non-specific testing like ESRs or d-dimers.


Extremely well said. This should push us to argue for more prospective cohort studies that can provide the needed data for multivariable diagnostic modeling. Patient registries can assist greatly in this endeavor.


There is an important distinction between probabilistic thinking and probabilistic tools. I agree, we are a long way from using tools to model each disease/outcome and would argue that is not practical or helpful. But I actually think we do teach probabilistic thinking, at least in my medical education thus far. For example, we are encouraged to come up with a DDx list of a minimum of 3 diagnosis in decreasing probabilistic order, based on the information available at that time. This is where I think the diagnostic test studies could change and better inform this process. If we were to construct simple models based off the key information available, and provide clinicians with the probabilities given that information, it would help them understand what diagnosis are more and less likely. Further it would give them an idea of HOW likely, which helps determine if further testing is required to confirm the diagnosis. These do not (and should not in my opinion) be specific prognostic tools, meant to be calculated at the patients bedside, but instead by reflections of what the probability is given the test result therefore giving the clinicians an idea of what the test is telling them about their patient.
To my eye, this approach to diagnostic studies actually involves less assumptions then reporting sens/spec, but the trade-off is they involve more interpretation required by readers. It is because of this limitation that I am trying to find ways to improve translation.

1 Like

To check my understanding: suppose we use a model and find p(disease | age, sex). We then do some test; now we want p(dz | age, sex, test), but suppose that model doesn’t exist. Hence we can use Bayes rule with the model’s p(disease | age, sex) as a prior and the sens/spec of the test to make the multiplier. In this case, sen/spec of the test were necessary (?)

Sensitivity and specificity are really functions of patient characteristics such as age and sex, but no one seems to take that into account. So we really don’t have what we need for Bayes’ rule either. Ideally you’d like to have an odds ratio for the test (product of LR+ and LR-) that is conditional on age and sex. You could make a crude logistic model if you knew the likelihood ratios. We need more prospective cohorts or registries to derive logistic models the good old fashioned way. That would also allow us to handle continuous test outputs, multiple test outputs, interaction between test effect and age or sex, and many other goodies.

1 Like

So if anything we needed p(test | age, sex)

To make p( dz | age, sex, test) = p(test | age, sex, dz) p(dz | age, sex) / p(test | age, sex) ?

So p(test | dz) or sensitivity is useless here then and in general when the prior conditions on anything

1 Like

That’s mainly correct, and now note the silliness of the backwards-information-flow three left turns to make a right turn approach. By the time you get sens and spec right, you have to recognize that they are not unifying constants but that the entire process needed to get P(dz | X) is more complicated than directly estimating P(dz | X) in the first place. So it all boils down to data availability, and lacking the right data, are sens and spec constant enough so that simple approximations will suffice for a given situation. That will depend, in addition to other things, on the disease spectrum and the types of predictors available.

Note that starting with P(dz | X) opens us up to many beautiful features: Treating disease as ordinal instead of binary, having multiple test outputs, allowing test outputs to interact with other variables, etc.

1 Like

I would add a caveat - we are always learning to improve our ability to interpret of tests. Novel tests, test improvements, or even the failure of memory all contribute to need for re-education.

Having a comparison list of tests with LR is much easier to read than a list with sens/spec, and lends itself to better conceptualization of magnitude of directional change, as well as multiplicative effects of multiple different tests.

1 Like

What about “updating” as new information is found? Say a doctor sees a patient for the first time and is worried about risk of MI, so they use a model to get P(dz | gender, cholesterol, smoking status). The patient returns next year as an uncle just had an MI. To now incorporate this new information, isn’t using Bayes rule necessary*? I kind of gather from your comments that the answer is no, there’s a way to just directly estimate the posterior P(dz | gender, cholesterol, smoking status, relative with MI) without Bayes rule, but I haven’t been able to work it out. Or are you suggesting to just build P(dz | gender, cholesterol, smoking status, relative with MI)?

*But if there had been a study to get the things necessary for Bayes rule to do this update, they probably could have just made a logistic regression in the first place…

Why not mine the EHR to figure out what combinations are most common and build models for those?

Yes, or in the absence of such data make various assumptions and use Bayes’ rule to update.

The following comments are an archive of comments entered under an older blog platform and related to the article In Machine Learning Predictions for Health Care the Confusion Matrix is a Matrix of Confusion by Drew Levy. Most of the comments were written in 2018.

Bob Carpenter: Thanks for the post. I’m a big believer in calibration and we’ve been pushing that as a baseline for Stan with our model comparison packages. Basically, we want the sharpest predictions for an empirically calibrated model (following Gneiting and Raftery’s terminology in their JASA calibration paper — I don’t know who invented the terminology in the first place).

With full Bayes, we get calibration for free if our model is well specified; too bad they almost never are, so we have to calibrate empirically.

I believe the fascination of the area under the receiver operating characteristic curve (AUC) for machine learning is because they can use radically uncalibrated techniques like naive Bayes (uncalibrated for real data because of false independence assumptions) to measure systems and not have them look totally idiotic.

As an example, I recently (in the last two or three years) had a discussion with a machine learning professor who was predicting failure of industrial components and couldn’t figure out how to do so when (a) some components failed every year, and (b) their logistic regression wasn’t predicting much higher than 10% chance of failure for any component. I never managed to convince them that their results might just be calibrated—they seemed to be getting the prevalence of component failures roughly right. Naturally, someone pointed them to AUC, so they went with that to produce a result other than 100% precision, 0% recall.

As Frank points out, precision (positive predictive performance in epidemiology language) and recall (sensitivity) are deficient because they don’t count the true negatives. In statistical language, they are not proper scoring rules.

The reason we care about sensitivity and specificity in diagnostic testing, where we eventually have to choose an operating point, or at least choose decisions on a case by case basis is that combined with prevalence, they determine the entire confusion matrix. They also let us tune decisions when the quadrants of the confusion matrix (like false positive and true positive) have different utilities (dollar values, expected life expectancy and quality changes, etc.).

In real applications, we often don’t have the luxury of varying thresholds. Instead, we have to work to build guidelines with good operating characteristics. Actions might be patient or item-specific, but they’re usually not. Usually it’s “give the patient a mammogram, if anything looks suspicious, escalate to MRI, and then if it still looks suspicious, go with a puncture test, and if that’s clean come back next year” (even though the first two tests have high sensitivity, low specificity and the last is the reverse). Frank can correct me here, but the guidelines I’m familiar with only tend to indicate when to start the process. For example, women over age 50 have higher population prevalence of breast cancer than those over 40; which causes a test with a fixed sensitivity and specificity to become more precise (higher positive predictive accuracy, in epidemiology terms). My wife’s been resisting the follow ups “just to be sure” (against medical advice!) from sketchy mammograms now that she knows a bit more statistics. We’ve sadly had a friend go through the not unusual [positive, positive, negative] result sequence for mammograms for two years and then die of breast cancer the following year.

In my applied work in natural language, the decisions of our customers almost always meant selecting an unbalanced operating point. For example, our customers for spell checking and automatic question answering wanted high positive predictive performance (they didn’t want false positives), whereas our defense department customers doing intelligence analysis wanted high sensitivity (they were OK with a rate of up to 10 to 1 false positives to positives or sometimes higher depending on importance, if the sensitivity was good).

P.S. The popularity of the Brier score (quadratic loss in a simplex) also puzzles me, though it’s at least a proper scoring rule for classification. Also, it’s not new—it came out of statistics in 1950, about the same time as stochastic gradient descent, another technique that ML people often think was a new invention for big data (“big” is relative to the size of your computer!).

The problem with log loss is that it’s very flat through most of its operating region, so despite also being a proper scoring rule, is not a very sensitive measure. So maybe Brier scoring is better. Do you guys have any insight into that? We’re looking for general recommendations for Stan users.

P.P.S. I had the advantage of my second ML paper (circa 1998 or thereabouts) being overseen by a statistician at Bell Labs, so we took some kind of singular value decomposition approach to dimensionality reduction, used some hokey search-based scoring, then calibrated using logistic regression (we didn’t quite know enough to build an integrated system). We were building a call routing application which mattered for the enormous call center of the customer (USAA bank).

Frank Harrell: Wow Bob. What rich comments. These comments should be more visable so I suggest you copy them as a new topic on if you have time. What you and colleagues are doing in the Bayesian prediction world is incredible exciting. Thanks for taking the time to write this! Just a few comments: The Brier score has a lot of advantages and there are various decompositions of it into calibration-in-the-large, calibration-in-the-small, and discrimination measures. But its magnitude is still hard to interpret for many, and depends on the prevalance of Y=1. Regarding sens and spec, I feel that even in the best of circumstances for them, they are still not appropriate. Their backwards-looking nature are not consistent with forward-time decision making. And this raises the whole issue of “positive” and “negative”. I try to avoid these terms as much as possible, for reasons stated int he Diagnosis chapter of Biostatistics for Biomedical Research.

My current thinking is that the types of model performance indexes that should be emphasized are a combination of things such as (1) deviance-based measures to describe predictive information/discrimination, (2) explained variation methods related to the variance of predicted values, (3) flexible calibration curve estimation using regression splines with logistic regression or nonparametric smoothers (I still need to understand how this generalizes in the Bayesian world), (4) calibration measures derived from (3). For (4) I routinely use the mean absolute calibration error and the 0.9 quantile of absolute calibration error. One can also temporarily assume a linear calibration and summarize it with a slope and intercept.

Drew Levy: Bob: this comment terrific: it bumps the discussion up to the next level! Thank you.

A key point that you raise and that I will now pay closer attention to is the difference between prediction applications (the specific focus in this post) and “guidelines”: “…build guidelines with good operating characteristics. Actions might be patient or item-specific, but they’re usually not.” This may be another prevalent source of ambiguity and confusion—similar to the subtle difference between prediction and diagnostics.

For our purposes here (clarifying appropriate methods), guidelines might be construed as classification with implications for a subsequent action(s), but also with the qualification that it is a heuristic subject to modification by additional information (e.g., clinical judgement). The guideline admits to incomplete information that is not part of the model rationalizing the guideline; whereas the goal of prediction is to formally include as much of that ancillary information (the structure and the uncertainty/randomness) for accurate individualized prognostication. And “classification”, per se, seems like yet a different thing as well. It is not hard to see how something like ‘classification with implications for a subsequent actions’ gets confused for prediction/prognostication.

What I am sensing is that the confusion about the specific purposes to which ML and prediction modeling are applied is likely broader then just ‘prediction vs. classification’. There is a lot to unpack here. Elucidating the various objectives, and matching them to appropriate methods, requires more thought, more clarity—and hopefully, eventually—some consensus on definition.

Thank you for advancing the discussion with your insight.

Jeremy Sussman: This piece blames machine learning for a problem that 100% belongs to medical researchers and medical statisticians.

ML’s do not use sensitivity and specificity! (More often they use precision and recall, which are even worse, but it’s unfair to blame ML for techniques they don’t use.)

In prediction work for medical journals, we’ve published with AUROC and calibration measures. We often include some slightly newer ones (Brier score, calibration slope), but we’ve never been asked to and no one has ever told us to avoid AUROC or recommended something better.

Please propose things that are better! Ask us to use them. Get them into places like the TRIPOD statement. But this seems unfair.

Frank Harrell: Jeremy there are two reasons that my experience does not at all resonate with yours. First, I serve as a reviewer on a large number of submissions of machine learning exercises for medical journals. I see sensitivity and specificity used very often. Second, precision and recall are improper accuracy scoring rules, i.e., are optimized by a bogus model. Further, it is well known that the c-index (AUROC) is not powerful enough for comparing two models. It is a nice descriptive statistic for a single model, if you want to measure pure discrimination and don’t care (for the moment) about calibration. Soon I’ll be writing a new blog article describing what I feel are the best current approaches for measuring added value of new markers, which relates to your plea for recommendations. Some of this is already covered in a recent talk you’ll see at the bottom of my blog main page. Explained outcome variation is the watchword.

Drew Levy: Jeremy, thank you for your comment and perspective. First, I would like to say that “blame” is not the intent or spirit in which the post was shared. We should all be learning together.

Secondly, I regret that you seem to have missed the main thrust of the piece: the ROC is derived of sensitivity and specificity. I believe that the underlying fundamental mechanics of the AUROC are opaque to most who use it. And that for the purposes of evaluating ML that is intended as a prediction_tool in health care, the AUROC–which carries through the conditioning of sensitivity and specificity-- does not afford what people think it does.

If in your work you have employed calibration and other methods for evaluation of model performance that have conditioning and information appropriate for prospective prediction, then you are to be applauded. That sets a very good and encouraging example.

You are correct to insist that alternatives should be provided as well. That will be forthcoming.

1 Like

Frank, sometimes I think we arbitrarily classify measures based on context but should we really do that? They are all inter-related anyway so if we dismiss one we end up dismissing a class of measures. I will take an example from page 43 of Hernan and Robins where heart transplant is the intervention and mortality is the outcome and Pr(death) and n are shown in the cells:

Clearly we can evaluate this using measures of effect as Hernan and Robins did but I could turn this around and consider the mortality as a test of the treatment status. In this case in those where gender=0:
Se = 2/10 and Sp = 9/10
LR+ = Se/(1-Sp) = 0.2/0.1 = 2
Reverting back to a trial RR = 0.2/0.1 = 2
LR- = (1-Se)/Sp = 0.8/0.9 =0.89
Reverting back to a trial RRc = 0.8/0.9 = 0.89
OR = RR/RRc = LR+/LR- = 2/0.89 = 2.25
Od(AUC) ≈ sqrt(OR) ≈ sqrt(2.25) ≈ 1.5
AUC ≈ 1.5/2.5 ≈ 0.6

So if we are to query Se and Sp should we then not do the same for the whole series of measures?

Addendum: I am not suggesting that any of them is good in all contexts e.g. RR is useless as an effect measure but useful as a LR etc