Discrimination, calibration and other metrics: can they tell us about the clinical value of a model?

Prediction models are sometimes developed to give overall prognostic information to a patient (“what are my chances, doc?”) or to inform a very wide range of possible medical decisions. Perhaps more commonly, prediction models are developed to inform either a single decision, or a small number of related decisions. For instance, the Framingham model is used to inform whether a patient should be given treatment (or more intensive monitoring) to prevent a cardiovascular event; the Gail model is used to determine chemoprevention for breast cancer; various “risk calculators” have been developed in prostate cancer to help men decided whether or not to get a prostate biopsy.

When we evaluate models, we want to be able to tell doctors and patients whether they should use that model in clinical practice. Because we are talking about a decision – use the model or don’t use the model – it is most natural to use decision-analytic methods. However, some have claimed that decision-analytic methods are not required, and that traditional statistical metrics (such as AUC, calibration plots and the like) can be used to determine whether to use a model in clinical practice. There are cases when this is undoubtedly the case, for instance, we would disfavor a model with a low AUC and gross miscalibration. But this is likely the exception rather than the rule: I propose that in most cases, accuracy metrics are unable to inform doctors and patients as to whether they should use a model.

I have created five typical, indeed, rather mundane, scenarios. In each, I describe the clinical use of the model (e.g. to predict risk of prostate cancer to inform whether to have a biopsy) and give performance characteristics (e.g. AUC). The challenge in each case is to answer the question as to whether the model should be used in practice based on those performance characteristics. The answer must be in a form that would allow a second party to agree or disagree replicably for specific reasons. So, for instance, “you should use the model because yes, there is miscalibration, but it doesn’t look too bad” would be unacceptable, because no criteria are given for judging the degree of miscalibration.

I will start with just one scenario and expand from there. If, as I suspect, no-one will be able to answer the question of the clinical value of models based on accuracy metrics, I will show the decision curves and will conclude that only decision-analytic methods can address questions of model value.

Scenario 1
Two models have been proposed to select patients at high-risk for aggressive prostate cancer amongst those with elevated PSA. These two models are evaluated on an independent data set. Should either of these models be used to advise men on prostate biopsy? If so, is one preferable?

Model A.
AUC 0.819
Pseudo R2 0.2065
Root MSE 0.36239
Variance of phat 0.0447587
Brier score 0.1371
Picture1
sc1modeladist

Model B

AUC 0.7934
Pseudo R2 0.2049
Root MSE 0.36759
Variance of phat 0.0196621
Brier score: 0.1420

sc1modelbcalib
sc1modelbdist

10 Likes

This exercise might have been easier to answer had I not recently read your papers on this topic cause I’m tempted to just ask for those decision curves. I’m new to this material, but interested in learning more about decision curves (and also use them in an upcoming project). In the spirit of the exercise, I tried to make a decision on the models’ usefulness based on just these metrics and calibration plots:

The main problem is that we need to decide a threshold for what is considered positive according to the model(s) (i.e. above which threshold value will we refer people for biopsy). Model A predicts comparatively low probabilities for a large part of the population. Up to 80% of the individuals in this population have a predicted probability < 0.20, while the actual probability of the highest risk individuals in this group is up to around 0.30-0.40 (which I think is quite high for aggressive cancer).
The predictions from model B are somewhat more ‘spread out’ and overall model B predicts higher risks for more of the higher risk individuals (taking 0.20 as the threshhold value, model B assigns a risk of >0.20 to about 30% of the population compared to about 20% for model A).
From this, you can already sort of see I’m somewhat arbitrarily putting the threshold at around a predicted risk of 0.20 and then I’d choose model B. I think the fact that model B underpredicts risks above the predicted 0.20 quite strongly is relatively less important, because all those individuals are at least above the cut-off of 0.20. I’m however not really sure if this threshold is valid for this type of aggressive cancer. I can imagine you would/could want to be a bit less strict.

Looking forward to the rest of this discussion! Thank you for putting in the time and effort of making this exercise.

1 Like

Well, yes, I will show the decision curves later on and we can then all see how they compare to the other metrics. You are right to think in terms of the regions of the calibration curve that are important. I think the relevant risk thresholds here are in the 5 - 20% range, so your value of 20% would be at the upper limit of what would be reasonable.

However, before you run ahead and choose model B, note that you are wrong about predictions from model B being more spread out (I actually reported the variance of the predictions, and it is higher for model A). Also note that model A has a higher AUC (by about 3 points). How do we know that the miscalibration we see in model A offsets the better discrimination? How bad does the miscalibration have to be to chose a model with lower discrimination? Also note that i) There is some miscalibration at low risk for model B and ii) a metric that combines discrimination and calibration is the Brier score, and model A is favored.

1 Like

I should have explained my ‘spread out’ comment better. The variance of the predictions from A is higher indeed, but I feel this is mostly driven by the small group of individuals with predicted risks above 0.6, while model B does not seem to predict any risks to be above 0.6. However, I think we would be mostly interested in the spread near the area where our cut-off would likely be. And here I think model B is still somewhat better off. Model A puts almost 40-50% of the population in a narrow band below a risk of 0.05, out of reach of a biopsy under any of the thresholds we are considering even though the actual risk of some of these people is higher than 5%, while model B does so for about 30-35% of the population. And in that sense the spread of the predicted risks from model B is higher at the lower end (from 0-0.20) than from model A.

Although B is somewhat miscalibrated for lower risks it appears to be reasonably calibrated from 0.05-0.20 where we expect our threshold to be while the inverse is true for A. The metrics such as the AUC and Brier I slowly find more difficult to judge lately. They are averages over the entire model and and all predicted values and don’t necessarily inform you on how good the model actually performs around the eventual cut-off you decide. For example, here I think the Brier score for A is lower because of the better calibration at high predicted risks than B. But as I said in my previous post, I think this better calibration at these high values is relatively unimportant as long as the predicted risks are above your cut-off.

So all in all, despite better overall metrics for A, I’d still go for B. I think mostly because of the better calibration in the area we are likely to put our cut-off and the proportion of the population that would be (correctly) above and below such a threshold. But I agree from this alone we can’t fully judge which model would ultimately peform best.

Interesting. I think I see how this can be context-dependent. I focus much more on highly aggressive cancers and in my practice it can be very detrimental when a model (model B) gives < 0.5 probability in a situation where the probability is much higher than 0.5 (underpredicting a lot at the higher probability ranges). In other contexts though the lower probability ranges should be the focus. Looking forward to the decision curve analysis.

2 Likes

So we seem to agree that there are some reasons to believe that B will be better, but we can’t really specify a clear methodology for arriving at this decision. We have a gut instinct that about the extent of miscalibration vs. improvement in discrimination, but if someone disagreed (“a 3 point difference in discrimination is a lot and the Brier score, which takes into account calibration, is also better”) we wouldn’t have much to say to them.

1 Like

That’s a good summary I think. What’s the next step? The next scenario?

I have posted the next scenario!

1 Like

As requested by @cecilejanssens, here are the ROC curves for model A

model a scenario 1 ROC

And model B
model b scenario 1 ROC

This is a really cool excercise!

I wonder, which discrimination performance metrics and curves you use (if any at all)?

I use c-index and Lift Curve when I’m concerned about resource constraint (it is basically the same thing as using realized net benefit with 0 treatment harm).

@Ewout_Steyerberg wrote about Lorenz Curves and Box Plots of the estimated probabilities.
I would use just simple histograms for the estimated probabilities with facets for Events and Non-Events.

You’ve seen Assessing the net benefit of machine learning models in the presence of resource constraints - PubMed?

key point: discrimination and calibration are for me, the analyst, to help me improve my model. Decision curves are for the decision maker to help them decide whether to use a model

Nigam sent the preprint half year ago after you kindly agreed to share it with me.
I even made a small reproducible example:

https://rpubs.com/uriahf/rnb

How do you use them in order to develop your model?

For me Discrimination and Calibration performance metrics/curves are “story-tellers”:

The c-index helps me to realize what’s going on because I can translate the number to a story (unlike prauc or f1 score). The same is true for calibration curves and that’s why I like the discrete versions even though the smooth versions do not spend free information (it’s easier for me to grasp simple proportions).

If discrimination is no good, stop and look for better predictors. If calibration is no good, worry about coefficients and intercept.

1 Like