Sensitivity, specificity, and ROC curves are not needed for good medical decision making

Frank, you bring up some good points as to why sens and spec are not useful. Backwards-time backwards-information-flow probabilities/ transposed conditionals resonate most with me.

In literature, sens and spec are described as being properties of the test, independent of prevalence - this is advantage I usually see of sens and spec over PPV and NPV. This does not make much sense to me. It seems to me that if the test is developed and validated using an appropriate population, then the predictive statistics don’t need to be independent of prevalence. Could someone explain why this is or is not really an advantage?

I think a ROC-like curve with an AUC statistic would be more useful if the two axes were PPV and NPV. Why don’t I ever see this used?

To the last question, it would no longer be an ROC curve :grinning:

The independence of prevalence is a hope (and is sometimes true) because it allows a simple application of Bayes’ formula to be used, if sens and spec are constant. The fact that sens and spec depend on covariates means that this whole process is an approximation and may be an inadequate one.

This entire paragraph is fantastic, and worth re-reading multiple times. Thank you Frank.


Agree. In my experience these concepts are not of much practical value, and most docs have to look up the terms to interpret studies. Which we do rather poorly.

1 Like

The main criticism I’m hearing is regarding how we teach the ideas around diagnostic evaluations. Ideally, we would have a probabilistic statement regarding the presence or absence of disease given a test result. The shortcoming of this approach is that we lack predictive models for various combinations of tests and patient factors. In general, a physicians experience is one where a test deemed excellent, good, or poor for meeting some diagnostic criteria. We usually base our assumptions on a test based on how sensitivity/specificity were reported for a test in a select cohort or case/control type evaluation.

Again, I think understanding direct probabilities is important and perhaps emphasizing the points you make are relevant to how we report diagnostic values in the literature. But from a practical sense, we are long way from changing how we clinically interpret test results. The huge spectrum of patient presentations, values, and treatment decisions won’t fit into neat probabilistic tools. I would like to see better integration of personalized risk for both patients and clinicians to guide treatment decisions. But I don’t believe this issue will change how we decide to respond to a troponin level vs. non-specific testing like ESRs or d-dimers.


Extremely well said. This should push us to argue for more prospective cohort studies that can provide the needed data for multivariable diagnostic modeling. Patient registries can assist greatly in this endeavor.


There is an important distinction between probabilistic thinking and probabilistic tools. I agree, we are a long way from using tools to model each disease/outcome and would argue that is not practical or helpful. But I actually think we do teach probabilistic thinking, at least in my medical education thus far. For example, we are encouraged to come up with a DDx list of a minimum of 3 diagnosis in decreasing probabilistic order, based on the information available at that time. This is where I think the diagnostic test studies could change and better inform this process. If we were to construct simple models based off the key information available, and provide clinicians with the probabilities given that information, it would help them understand what diagnosis are more and less likely. Further it would give them an idea of HOW likely, which helps determine if further testing is required to confirm the diagnosis. These do not (and should not in my opinion) be specific prognostic tools, meant to be calculated at the patients bedside, but instead by reflections of what the probability is given the test result therefore giving the clinicians an idea of what the test is telling them about their patient.
To my eye, this approach to diagnostic studies actually involves less assumptions then reporting sens/spec, but the trade-off is they involve more interpretation required by readers. It is because of this limitation that I am trying to find ways to improve translation.

1 Like

To check my understanding: suppose we use a model and find p(disease | age, sex). We then do some test; now we want p(dz | age, sex, test), but suppose that model doesn’t exist. Hence we can use Bayes rule with the model’s p(disease | age, sex) as a prior and the sens/spec of the test to make the multiplier. In this case, sen/spec of the test were necessary (?)

Sensitivity and specificity are really functions of patient characteristics such as age and sex, but no one seems to take that into account. So we really don’t have what we need for Bayes’ rule either. Ideally you’d like to have an odds ratio for the test (product of LR+ and LR-) that is conditional on age and sex. You could make a crude logistic model if you knew the likelihood ratios. We need more prospective cohorts or registries to derive logistic models the good old fashioned way. That would also allow us to handle continuous test outputs, multiple test outputs, interaction between test effect and age or sex, and many other goodies.

1 Like

So if anything we needed p(test | age, sex)

To make p( dz | age, sex, test) = p(test | age, sex, dz) p(dz | age, sex) / p(test | age, sex) ?

So p(test | dz) or sensitivity is useless here then and in general when the prior conditions on anything

1 Like

That’s mainly correct, and now note the silliness of the backwards-information-flow three left turns to make a right turn approach. By the time you get sens and spec right, you have to recognize that they are not unifying constants but that the entire process needed to get P(dz | X) is more complicated than directly estimating P(dz | X) in the first place. So it all boils down to data availability, and lacking the right data, are sens and spec constant enough so that simple approximations will suffice for a given situation. That will depend, in addition to other things, on the disease spectrum and the types of predictors available.

Note that starting with P(dz | X) opens us up to many beautiful features: Treating disease as ordinal instead of binary, having multiple test outputs, allowing test outputs to interact with other variables, etc.

1 Like

I would add a caveat - we are always learning to improve our ability to interpret of tests. Novel tests, test improvements, or even the failure of memory all contribute to need for re-education.

Having a comparison list of tests with LR is much easier to read than a list with sens/spec, and lends itself to better conceptualization of magnitude of directional change, as well as multiplicative effects of multiple different tests.

1 Like

What about “updating” as new information is found? Say a doctor sees a patient for the first time and is worried about risk of MI, so they use a model to get P(dz | gender, cholesterol, smoking status). The patient returns next year as an uncle just had an MI. To now incorporate this new information, isn’t using Bayes rule necessary*? I kind of gather from your comments that the answer is no, there’s a way to just directly estimate the posterior P(dz | gender, cholesterol, smoking status, relative with MI) without Bayes rule, but I haven’t been able to work it out. Or are you suggesting to just build P(dz | gender, cholesterol, smoking status, relative with MI)?

*But if there had been a study to get the things necessary for Bayes rule to do this update, they probably could have just made a logistic regression in the first place…

Why not mine the EHR to figure out what combinations are most common and build models for those?

Yes, or in the absence of such data make various assumptions and use Bayes’ rule to update.