Sensitivity, specificity, and ROC curves are not needed for good medical decision making

I believe that sensitivity, specificity, and ROC curves are not useful for medical decision making, e.g., which action to take after getting the results from a diagnostic test. My arguments rest primarily on problems with backwards-time backwards-information-flow probabilities and problems caused by dichotomizing tests and diagnoses. Drew Levy has written eloquently about problems with transposed conditionals. It is important to respect the flow of information by conditioning on what you already know in order to predict what you don’t. Predictions/forecasts of outcomes as well as knowing the consequences of decisions are key to making good decisions.

Assume that we have a reliable (well calibrated) estimate of disease risk, and assume that the precision (e.g., width of a compatibility (aka confidence) interval, or a Bayesian credible interval) is so good that uncertainty in the point estimate may be safely ignored. [In most situations, the training sample size is small enough that precision can’t be ignored, and a full Bayesian calculation of expected utility that uses the entire posterior distribution of risk for an individual patient should be used.]

The decision that optimizes the expected utility (and given the utility for the patient at hand) can be derived as a risk cutoff in this infinite training sample size situation. Nowhere in the risk estimate is the sensitivity (sens), specificity (spec) or an ROC curve necessary. And sens and spec do not figure into formulation of utilities. It follows that the optimum decision does not need sens, spec, or ROC curves in any way. Another way to say this is that incorporating sens and spec into the decision rule is like making 3 left turns instead of a right turn. But once you know the risk of disease and the patient-specific utilities you’re really done.

Here is the derivation. Suppose that there are two possible actions: A and B, where B means “not A”. For example an action may be to get a prostate biopsy after a PSA test in a man. Let Y=0 denote “not diseased” and Y=1 denote diseased. Each action and each true diagnostic status will have associated with it a cost or loss. Define these losses by the following table.

Action     Y     Loss
A             1     a
A             0     b
B             1     c
B             0     d

For example, the loss from taking action A if disease is present is a. What is the expected loss since we don’t know the true disease status? If the patient’s risk of disease is r, the expected loss is ra + (1-r)b from taking action A. Likewise the expected loss from taking action B is rc + (1-r)d. One way to optimize the decision is to choose the action from (A,B) that gives the lower expected loss. Thus we take action A if ra +(1-r)b < rc + (1-r)d and otherwise take action B. Solving for r, we take action A if r < (d-b)/(a-c+d-b). If action B is the more aggressive one such that the loss is zero if the patient is ultimately diagnosed with the disease, one might take c and b to be zero. In that special case the risk threshold is d/(a+d).

Note: This formulation is fairly general, subject to risk being estimated without error. The risk estimate r doesn’t care whether it is dominated by risk factors or by the results of the current medical test; nor does it care whether the test has a binary vs. a continuous output, multiple outputs, or whether the test results interact with age or sex. Contrast that with sens/spec which assume a binary diagnosis, binary test output, no interaction between test results and other patient variables, and constancy of sens and spec over patient types (the latter is provably untrue in general when the diagnosis was created by dichotomania). When there is uncertainty in r, this uncertainty needs to be used in averaging losses to obtain an optimum Bayes decision that minimizes expected loss. This is discussed in BBR Chapter 19.

Here’s an analogy concerning the value of forward-looking decisions and probabilities: an optimum decision in a poker game is based on the possible winnings/losings and on the probability of winning the hand. It doesn’t help the player to envision the probability of getting a hand this good were she to go on and win or lose (spec/sens). Decision making is forward in time and information flow, and needs to use forward probabilities (unless you are an achaeologist or a medical examiner).

This discussion is related to the way that medical students are wrongly taught probabilistic diagnosis. I’ve seen MDs and statisticians alike, when given all the numbers needed to compute P(disease | X) still compute sens and spec from a cohort study and then use Bayes’ rule to get P(disease | X). They don’t realize that everything cancels out (3 left turns) leaving the originally-trivially-derivable direct estimate of disease risk (right turn). I think that another point confused in the literature is the difference between group decision making and individual patient decision making (I’m only interested in the latter).

Now for a real kicker: Anyone using sens and spec in formulating decision rules is simply wrong if they consider sens and spec to be constant over patient types. Using sens and spec, when doing so accurately, only adds complexity because we know that sens and spec vary over patients so to be accurate you have to incorporate probability models for sens and spec that are functions of patient characteristics. Details about this are in the BBR diagnosis chapter. Simple explanation: any disease that is not binary will be easier to detect when it is more severe. Any patient characteristic associated with severity of disease will be associated with test sensitivity.

I’d be interested in any demonstration that a backward-information-flow probability (including an ROC curve whose points are constructed from backward-time probabilities) is necessary for making optimum medical decisions for individual patients.

Some useful papers on medical decision making are here.


Your writing (and others) on this topic has convinced me of the utility of estimating risk instead of classification metrics, even more-so now as a medical student while being constantly bombarded with the difficulty of applying classification metrics when assessing a patient.

However, the one argument I am constantly presented with when discussing these issues with clinicians is the screening -> diagnosis problem i.e. using a screening test to detect possible latent disease then following up with a more definitive test. This approach has become routine in healthcare, and is even used to teach sens/spec in my medical school. Although the principle makes sense to me, it seems to be that this approach is based on the dogma that sens/spec are the correct way to evaluate measures instead of the clinical necessity. If we are instead guided by the idea that we want to limit over-testing and maximize early detection, then assessing for risk factors and applying the more rigorous test to those at higher risk (however the clinician defines this) makes perfect sense and is better supported by the prediction approach you describe.

Up to this point I think clinicians follow and understand the justifications. Where i’ve found that I start losing them and they resort back to classification is in the translation of these ideas through the research study. Recognizing the practicalities of clinical practice, we don’t have good prediction models for every disease, nor the universal means/will to calculate risk for each patient. My question then becomes how do we convey the greater information acquired from the prediction approach in a pragmatic way that can be used by clinicians? My only answer so far is to draw the line generated from a prediction model, representing probabilities on the y-axis and a significant continuous variable or a score on the x-axis. Factor variables can be represented as different lines as shown in this example comparing two scores for mortality prediction (dotted line represents observed probability, solid represents predicted probability, grey ribbon represents 95% confidence intervals for observations).

Similarly, multiple plots with different continuous variables, or additional lines generated for an example patient could be created. But translating this to clinical practice still requires clinicians to have a good grasp of how to read figures such as this, then determine what the appropriate decision threshold is for their patient and their setting. This seems like a significant barrier without a dramatic shift in training during medical school. If there are alternatives that can be pragmatically applied, I would be thrilled to see them.
Having benefited from the shift in thinking the accompanies an understanding of probabilistic decision making, I am absolutely a proponent of this change in medical education. But as with many significant changes in thinking, I suspect progress will be slow. As an epidemiologist, I will advocate for better practice in research studies as I believe the change will need to start with researchers, then hopefully medical education will follow-suit as the way evidence is presented begins to shift.


It’s so good to see the way you are thinking this through Dan. Would you please break down your thoughts on the key tasks involved with the clinical working that are alluding to? Is it the task of understanding how each clinical feature/risk factor relates to risk? With adjustment or without adjustment for other factors? Is it the task of getting the overall current best risk estimate? Something else?

1 Like

Overall the key task is to make some clinical decision based on information gathered. As researchers I think we need to provide clinicians with the maximum amount of information possible to support these decisions in different contexts, and with different patients. The key challenge with this though, as I discuss above, is to make the information we provide both pragmatic and digestible so that it will actually be used in practice.

I often think of this in the context of acute care diagnosis as I think it is easiest to conceptualize. So:

  1. Starting with a patient having some unknown problem, we start with a history/physical to begin narrowing down the organ systems that might be involved (beginning of information gathering).
  2. We order some tests to help isolate the problem further.
  3. Based off these results we might order more tests, or perform some treatment intervention.
    Each of these steps builds on the previous, and the consequences of building on wrong information/assumptions for each step also grows. Therefore, the way I see diagnosis problem is to use probabilities from each step to select the necessary actions for the next step, based on either i) what information you need to identify the most likely disease based on the information you already have, or ii) select the lowest risk intervention based on the patients expected need (patients with higher risk for bad outcome likely require a more risky/aggressive intervention).
    I think all risk estimates should have adjustment, at minimum for age and sex, to improve generalizability and help clinicians better understand the risk for the patient in front of them. Adjustment seems to make a big difference for some diseases, and less for others.

As researchers I think our job is to identify the measures that provide the most information to clinicians about a patients risk for clinically relevant outcomes (likely identified by clinicians) at each step, and in the context of the clinical presentation. But while in an ideal world we would do this for all information available, this is not realistic nor digestible. So then to remain pragmatic, I think we need to identify the highest information measures, then let clinicians fill in the gaps using their experience. This will inevitably not be the “best” estimate, but hopefully better than what sensitivity/specificity could provide.


Great discussion, and I can’t argue that when faced with the application of a single model based on observational data at a single point in time on a single patient, sensitivity, specificity, etc., will not be helpful.

However, almost never in healthcare is a patient managed like that. Big decisions, such as starting a medication or having a procedure are almost always tied to results from an RCT, for which (like it or not) study entry is binary (you either meet entry criteria or you do not). Some (myself included) have entertained the idea of using results from an observational model such as polygenic risk score, for determining this enrollment. In this case, risk must be binarized, otherwise how do you decide whom to enroll?

Regarding the (overused, I believe) approaches for modeling mortality in hospitalized patients; frankly, this just isn’t done in practice. At no point on ICU rounds do we sit down and determine the mortality risk for every patient, unless it is tied to a future discussion about upcoming procedure or goals of care (deciding on whether to pursue dialysis in a nonagenarian on multiple pressors). Rather, we focus on collecting more data and making small adjustments based on physiology and pathophysiology toward proximal goals, such as increasing the blood pressure, improving cardiac output, or improving urine output. I do think there may be a role for quantitizing this process, although I would argue that the approach will find more success in reinforcement learning approaches, which are only starting to be tested in a healthcare setting.


I remember about 10-20 yrs ago we got an ECG machine that incorporated the ACI-TIPI rule for ACS. In addition to three factors recorded by the ECG machine, you entered four clinical factors, and got a “predictive value”, rather than a sensitivity or specificity. The ECG machine actually printed the estimated probability of ACS on the ECG itself, and it was always alarmingly high. Where we worked this was not helpful because like most inner city public hospital/trauma centers, the prior prob of ACS was pretty low, unlike the hospitals in the Northeast where the rule was derived and validated.

There is some appeal in the sn/sp approach, even when it is not well understood: Doctors tend to think in terms of discrete diseases, which they traditionally spend the entire second year of medical school listing & memorizing (eg 80% of RMSF patients have a headache). Sensitivity is a quantitative description of a disease, tho possibly of little use in discriminating between dz and non-dz. But together, sens&spec to my mind are the “weight” or “yield” of the test, useful for comparing with other tests (eg choose between CT vs cath vs hospital admission, my main task in the ER). Similar tests may involve different degrees of cost or risk, or outcome utilities vary: all part of intuitive decision making. This “yield” is a much more transportable across the medical enterprise (Northeast vs South, office practice, ERs, specialty practice, ICU, urgent care centers, retail clinics, public hospitals, private clinics, etc) than the model-generated posterior probability. Understanding the base rate shift (the “inverted probability” problem) and the concepts of spectrum and verification bias are a couple of fundamental modifying principles that every doc really ought to be taught. AUROC is pretty irrelevant too: useful decision thresholds occur over a limited range of the curve, depending on the clinical circumstance.

1 Like

@Michael.a.rosenberg I appreciate your comments about the overuse of modelling mortality in hospitalized patients, but one area where I can see value is in using these measures as an assessment of illness severity to guide decision making. Especially in the ICU where the risk of mortality is so high (by design) I could imagine that any measure that indicates a higher risk for mortality could be a stronger indication for aggressive versus supportive treatment.
I think your point about using RCT data to inform big decisions is an important one. But while the original study has more restrictive inclusion criteria by design, I still wonder if there is value in modelling risk of similar patients to help clinicians treat similar patients who may have not met the original inclusion criteria better.

@srpitts This ECG example is really interesting to me. This might be a good illustration of the issue I originally commented on and which you also bring up, specifically how to use this information. While the specific probability estimates you were seeing seem like they were not well calibrated, an understanding of the predictive-value of the ECG algorithm still seems valuable to me; however, it may be more valuable in the context of choosing which test to order next instead of for the specific patient. This might be thought of as similar to the weighting or yield of the test as you mention. I also have had the experience of memorizing what test should be used for which context in the evolution of diagnosing some pathology so I wonder if a better approach to describing this is possible using a predictive approach. For example, I have tried to quantify this in studies using measures like the Gini’s coefficient.
You touch on another key concept I have also noticed too, that doctors tend to think in terms of discrete diseases. While we know this is what patient’s and clinicians expect, my experience so far has been that a discrete disease rarely exists - even just by considering comorbidity’s. So this brings me back to my original question - how do we best convey the most useful information to clinicians, in the context of complicated clinical presentations, while conserving the maximum amount of information possible?

Models have to be useful, and for that they have to be goal-oriented. Therefore, the objective is to understand what is not known about a patient, then design a model that tells you what you don’t know. But if you want your model to be useful, development has to be coupled to decision making. This does not happen on many occasions, and it is something that statisticians/mathematicians are not able to grasp without clinical feedback, but my experience is that clinicians themselves do not sometimes understand the problem either. For example, the MASCC score for predicting complications in febrile neutropenia. Febrile neutropenia is a condition in which there are two types of patients, those apparently stable, whose clinical course is unknown, as opposed to the very critical who require urgent treatment or die. The question is not whether you have to intensify therapy for critical patients, which is obvious. The problem is how to individualise the intensity of treatment in non-critical patients. The MASCC score assess them all equally, although the approach to decision-making is not similar at all, which generates contradictions. You may end up reclassifying critically ill patients as low-risk. When all patients are evaluated in a mixed cohort, the sensitivity of the model is between 60-70%, but when only the apparently stable ones are evaluated, which are the ones that really need the prognostic information, the model has a sensitivity of less than 30% (not predict 70% of complications in stable patients that can be sent home for ambulatory treatment). Therefore, yes, I agree that sensitivity, specificity and area under the curve are not the only requirements to evaluate prognostic models, but I also believe that the issue is not only mathematical, but clinical.

What you wrote raises a lot of great issues. A general comment is that in diagnostic research we too seldom involve patients in setting utilities for various actions. I recommend a three-phrase approach:

  1. Invest the time and money in eliciting utiities (my preference is the time tradeoff variety) for various actions and outcomes, from non-diseased individuals who may someday be at risk for the conditions under study
  2. Combine these utilities with validated risk estimates to be able to solve for actions that maximize expected utility
  3. Apply what is learned to real-time decision making. If this isn’t feasible, develop an approximation to the ideal solution that can be used routinely.

The last step isn’t obvious but may motivate someone to find a solution for a particular clinical problem and patient population.

There is another sublety to what you wrote: the correct implication that the initial diagnostic workup may be too high-dimensional to be able to have the probabilities and utilities you need for optimum decisions, and that you need heuristics to narrow things down. This leads to ideas such as ranking diagnoses from most likely to least likely to be true. The problem with that heuristic is that if the probability that the most likely diagnosis holds is less than, say, 0.9, the ranking process will hide uncertainties that are so massive that it will cause downstream errors. So it’s hard to avoid formalizing risk estimation.

Michael you are not alone in making this leap of logic. For reasons detailed here, this logic doesn’t work. This goes back to experimental design in agricultural experiments, which taught us how to design RCTs. In neither agricultural nor medical situations does the worth of the treatment estimate come from representativeness of objects (whether plots of land or patients). I think you are mixing two ideas. The RCT is designed to provide relative therapeutic effectiveness estimates that are applicable way beyond the study’s inclusion criteria, then we combine the relative estimate with absolute risk estimates to get absolute risk reduction estimates for various entertained treatment options. The risk estimates may come from the RCT itself if the inclusion criteria were fairly broad, or slightly more commonly, from a diverse observational cohort.

I believe that backwards-time backwards information-flow probabilities only give the illusion of usefulness, and you can accomplish all that using only predictive mode forward thinking. [Note: Had you admitted that sens and spec varied significantly by patient type I would have been more impressed with this approach.] Sens and spec get medical students thinking indirectly and inefficiently IMHO. But you are quite right that weight or yield of a test is a useful concept. The weights can come solely from a predictive mode rather than a retrospective one though. To elaborate, many experts in medical decision making rightfully believe that diagnostic likelihood ratios (LR) are more helpful than sens/spec. The LR+ is the factor that moves a patient from an “uncertain” pre-test risk to a post-positive-test risk (this is making the huge assumption that the test is binary). LR- is the factor that moves from an initial state of uncertainty to a risk were the test negative. If instead of moving from a rather poorly defined initial state to a +/- state one were to think of predicted risk and odds ratios, things are better defined, and a beautiful thing happens: the odds ratio for a binary test in a logistic model is LR+ times LR-. So the odds ratio may be the weight you seek. :new: One more thing about LRs. The initial point of uncertainty, usually thought of as background disease prevalence, is actually not very well defined. Instead of using a notion of prevalence, a logistic risk model uses a covariate-specific starting point that is easy to comprehend. An odds ratio moves you from a particular starting point (e.g., age=0 or no risk factors present) to the odds of disease given a degree of positivity of a test.

One error we’re making in medical education is the implication that disease prevalence (for use in computing the base shift you speak of) is well-defined and is an unconditional probability. Neither is true. Prevalence gets in the way of understanding as much as it helps. More useful would be an anchor such as “The probability of dz Y for a minimally symptomatic 30 year old male is …”. This is the baseline covariate way of thinking.

I have trouble seeing why retrospective quantities are relevant to this discussion. It seems to me that what the clinical situation is crying out for is:

  1. Having expert clinicians and biostatisticians fully collaborate to develop a predictive-mode prospective cohort-based outcome model that incorporates clinical course up until the current time
  2. Make sure the model contains any clinically-sensitive interactions with treatment
  3. Combine with an RCT that estimated relative efficacy if possible
  4. Combine all this with patient utilities
  5. Don’t use labels like “low risk”. “Low” is in the eyes of the beholder, and ignores patient utilities.
  6. When we can’t elicit patient utilities ahead of time (using normal volunteers, typically) or in real-time from the actual patient or family involved, get utilities (ahead of time) from a panel of medical experts.

Appreciate your comments, and got me to thinking a bit more about a topic I really love, because ER docs are the ultimate probability workers, seeing a steady stream of almost undifferentiated patients (septic shock, then vaginal bleed, then persistent cough, then opioid user w a new pain). Sometimes we are so ignorant that the “prior probability” of a given disease is exactly 0.5, but mostly we have a vague idea (“we never see dz X here”). This is indeed an anchor, but it is not based on a covariate vector, or if so, then a very primitive one. A simple rule that most students find completely intuitive is the mnemonic: SNOUT (sensitivity rules out), important when your main job is “not missing a serious case”, like pulmonary embolus. And the obverse is SPIN “Specificity rule in”, also useful in ER when you can “rule in” a trivial self-limited illness like shingles. Even the LRP and LRN may not be worth the additional trouble of looking up or calculating the ratio, and the diagnostic odds ratio is definitely a bridge too far, because it is even harder to interpret, and is only a measure of discrimination rather than calibration. I don’t think clinicians are fooled into taking “spin” and “snout” too seriously. The rule is truly a “heuristic” to be used when you don’t have time to look something up or consult a validated model, i.e. not a formal procedure. The good news is that ER docs as a group more and more consult formal models on-line, on the wonderful website “MDCalc”.

1 Like

Couldn’t agree more with @albertoca about need for any quantitative approach to be directly tied to a specific question in clinical context. This was my point about prediction of mortality in hospitalized patients; we just don’t think that broadly as clinicians when evaluating a patient, even the critically ill ones. Management is almost always focused on a single or handful of decisions aimed at a accomplishing near-term goals, usually based on understanding of the biological process felt to be relevant at that time. I won’t disagree that there is some level of latent ‘gut feeling’ about how sick a patient is, which sometimes guides how aggressive we might need to be in terms of interventions, but this entity is poorly understood on an objective level (i.e., Malcolm Gladwell’s Blink phenomenon), and thus attempts by outside analysts and statisticians to quantify it tend to be overlie simplistic, and as @srpitts notes often to a comical degree.

Regarding RCT’s, I don’t disagree that the principal reason for conducting an RCT is to use randomness as a surrogate for counterfactual exposure to establish existence and degree of causality. However, like the simplistic SNOUT/SPIN interpretation of sensitivity and specificity for tests run in the ER, there is utility in a simplistic (binary) interpretation of results of an RCT, especially when combined with clinical experience and understanding of physiology and pathophysiology to guide a specific decision for a specific patient. This isn’t to say that this application is the global optimum for evidence-based medicine, but if there’s a better way to use data to guide efficient decision-making, it certainly hasn’t made its presence known to the greater medical community. If you don’t believe me, next time you’re in the doctor’s office, ask them to explain how they use sensitivity and specificity in their decisions. Chance are they can’t do better than SNOUT/SPIN; yet, I imagine you still listen to what they might tell you regarding your health…

My challenge to data analysts out there who think they know how we should be using data better in our day-to-day clinical decision-making is to come join us, spend some time in the clinic and see how practical your methods are in application. I’m by no means making the argument that the process can’t or shouldn’t be improved; in fact, I’m working myself from the other side to find ways to bring data into our regular clinical decisions (hence my presence on this forum). But it turns out that humans tend to be pretty good at processing a lot of information in making decisions, and most attempts to use complicated models for that process fall flat on their faces in real-world application. The QWERTY keyboard is not the most efficient method for typing words into a processor, but the barrier to improvement is high. The stakes in clinical decision-making, where lives are at risk and there is a zero error tolerance, are even higher.


Should this be a conditional P(disease X | … ) ?

Yes, sorry - am fixing now.

For the example from BBR, just considering gender

P(ECG + | CAD +, M) != P(ECG + | CAD +, F)

is this what’s meant by sensitivity “varying over patient types”?

And if

P(ECG + | CAD +, M) = P(ECG + | CAD +, F)

then sensitivity would be considered “constant” over patient types?

Because also

P(CAD + | ECG +, male) != P(CAD + | ECG +, F)

So both sensitivity and the posterior probability are not constant over gender. Is your argument that the posterior is amenable to conditioning on gender, whereas sensitivity, by definition, conditions only on disease status (and that although sensitivity technically could condition on other things, no one does that)?


Yes, well written. Researchers and practicing clinicians have been wrongly taught to assume that sens and spec are constants, i.e., are properties only of the test. Sens and spec would have been useful had they actually been unifying/simplifying constants.

On the other hand, we are taught from the get go that probabilities of disease naturally vary with age, sex, symptoms, etc. And binary and ordinal logistic models handle this automatically with no new developments or complexities needed. Regression models also allow the possibility for unifying constants. When a diagnostic test does not interact with age, sex, symptoms, etc., the regression coefficients for the test apply to all patient types. When the test is binary (almost never happens but is routinely pretended), this is the +:- test odds ratio.

To summarize some of the problems with using backwards-time probabilities sens and spec:

  • They are functions of patient characteristics and not constants. This is especially true for sensitivity when more advanced disease is easier to detect but the clinician dumbed down the problem by pretending the disease is binary.
  • Sens and spec are modified by workup/verification bias, e.g., when males are more likely to not get a final diagnosis than female. Applying complex corrections to sens and spec for workup bias and then using Bayes’ rule to get P(disease | X) makes the complex corrections cancel out.

On the last point, starting with the goal of estimating disease probability rather than estimating sens and spec saves time and lowers complexity, while being more intuitive. A win-win-win.


I absolutely agree that diagnostic test research should be focussed on estimating the individual patient prevalence of disease, conditioned on patient characteristics and tests. Where I still see the challenge is adequately translating this information to clinicians.
In my own training, I have noticed myself and many of my colleagues have just been learning what test to order i.e. differentiating a good test from a bad test, or a screening test from a diagnostic test. Although simplified, I suspect that this is one objective we should keep in mind for these studies by validating discrimination and calibration then @f2harrell recommended using a likelihood ratio or AIC to compare the full regression models.
If calibration is presented as I showed above, with probability on the y-axis and the measure/score on x axis, then this also allows clinicians to get an estimate of the the probability for the outcome for different patients. Again I suspect they will (as I do) remember discrete thresholds and how they will manage a patient differently at these thresholds, but at least these decisions are based on the probabilities generated from a full model not a single point like sens/spec, and it is then the clinician taking into account expected utility in their setting.
I recognize that there is a trade-off between entering too many characteristics into the model (i.e. every clinical measure or physical exam finding), therefore making it too specific and dependent on all measures to be accurate, versus entering less measures so it is more generalizable and therefore sacrificing accuracy. My approach to navigating this trade-off so far has been to use the information likely to be available to the clinician at the time of applying the test (which is minimal for me since I focus on EMS/prehospital). Then if we are focussing on the slope of the line (i.e. odds ratio) as discussed above instead of just the absolute estimates, we are left with a good assessment of how much weight the measure should provide to our decision making.


Well you start off with nothing. Then you get more information, like gender, (analogous to first probability revision) which you build in to your initial mental picture = P(Dz|Findings in first 5 minutes). The important “findings” worthy of more formal study and quantitative analysis are 1) decision rules, and 2) single findings that have potentially giant dx odds ratios (eg febrile pt is from Congo in 2019), or are risky (eg central venous pressure), or expensive (imaging, lab). These are the tests that sometimes get published with LRs, making the assumption that you use them wisely, and often in tables - when there is a lot of heterogeneity, eg by sex. Not unlike the HR in an RCT done in a population not like your patient. I’ll grant you that the idea of tabulating compendia of LRs as if they were constant properties of a given test/dz couple has faltered in the commercial sense: I think that the book “Decision making in imaging” is probably out of print, and certainly out of date, but mainly because of the march of progress, i.e. the moving target of technology assessment in the modern era.

1 Like

It’s important to distinguish between sensitivity/specificity and predictive value. Sens/spec have a role, but in screening, where the objective is to winnow out the people with the problem from a clinical population. However, if you are dealing with decisions about individual patients, then they are not a great deal of help.

For example, nuchal translucency scanning in pregnancy was only 72% sensitive and 96% specific. The positive predictive value was just 5·3%. But here’s the thing : the negative predictive value was 99·9%. Because it’s non-invasive, it can winnow down your original population by about 95%. The remaining 5% need to be reassured that most positive NT scans are false positives, and some of them will certainly go forward for more invasive testing. However, the value of the test lies in its ability to give a high degree of reassurance to those 95% of couples with negative scans.

So perhaps sens/spec are getting criticised for doing a job poorly that they are not meant to do at all.

It’s also worth noting that statisticians operate on what Gerd Gigerenzer calls “The God Model” where all relevant information is available simultaneously and there are computational resources to fit it into a prediction model. In real life, information is generally available sequentially. A patient presenting with chest pain will present with information about age and sex right away, and getting an ECG done is quick. If the ECG shows an MI, then treatment must be initiated immediately, without waiting to see what the cardiac enzymes look like. So clinical decision making tends to operate on a frugal tree model, where decisions are made by examining the predictive factors one at a time.

In this respect, what many statisticians need to learn is how clinical decisions are made. Otherwise we will build endless ‘hobby’ models that will never really impact on clinical practice. I say this rather gloomily, reflecting that after 50 years of functions to estimate cardiovascular risk, a very large proportion of people are being managed ‘sub-optimally’ (a medical term indicating the sort of train wreck that you shouldn’t discuss aloud in front of the patient or their grieving relatives).


Well said Ronan. On the possible use of sens and spec for screening, I have doubts even there. I think that pre- and post-test probabilities of disease are simpler concepts and more directly actionable.