Optimising models for calibration

davidcnorrismd · October 12, 2025, 8:34am

Where decisions are being made with tacit thresholds, perhaps the unstated part of the decision-making amounts to an on-the-fly recalibration, adjusting to factors that were not explicit (or explicitly ignored) in the original modeling?

Uriah · October 12, 2025, 9:05am

An uncalibrated model will have a negative NB (i.e would be worse off than treat-all and treat-none strategies).

It’s quite common with ML algorithms that are designed for optimizing auc / c-index.

f2harrell · October 12, 2025, 11:38am

That’s entirely possible. More likely to me is that risk estimates are trusted but utilities are being tacitly applied in choosing the decision.

samw235711 · October 12, 2025, 12:47pm

I think there are a few interesting things that emerge here, as they relate to calibration.

We have a difference between a model for diagnosis and a model for prognosis, the difference between threshold-based and non-threshold based decision making, the difference between randomized and observational data, and how statistical inference relates to all of it. Overall, it seems that the question on calibration involves all of these things, which makes it particularly difficult to answer.

In the original example, we have a sort of treatment decision (whether to discharge or not) in the setting of a diagnostic problem (whether there is an heart attack). This can be conceptualized, though, with a prognostic model. Suppose we have, for example, information corresponding to an EKG. The treatment decision D is whether to discharge. Let S be a binary RV for 10-year survival. We need to maximize its expectation, E(S|ekg) = P(S=1|ekg)=\sum_dP(S=1|D=d,ekg)\pi(D=d|ekg)P(ekg). We can do this wrt \pi. We do not need to estimate the diagnostic model, P(HEART ATTACK |EKG). Further, the prognostic model, P_n(S=1|D=d,EKG) has to be well calibrated, as in the orange juice example, within some region that is defined by the data.

One could focus not on discharge but instead on catheterization, though, and then it is more like the diagnostic problem, where the action is whether or not to do a test (which aligns with whether or not to do a prostate biopsy).

I preface the next paragraph: I wrote it entirely based on this discussion here, before reading Against Diagnosis (maybe I read part of it many years ago).

We almost always assume that if we can make the diagnosis, heart attack, we are done: then, we just have to treat. It varies by problem,* but especially with medical conditions that don’t fit into neat diagnostic boxes, this is incorrect. Rather than worry about diagnosis, we are sometimes better off spending our energy thinking about what to do given the information that we have. Then, we can focus on the uncertainty about what will happen after we do it. This is what really matters.

The treatment problem is similar I think (maybe the same) as what is called the prediction problem in Against Diagnosis. However I would propose to solve it by finding a policy that maximizes expected utility. In my mind, in this way, with the prognostic orange juice example, and the EKG example above, we make a decision without a threshold.

In my experience, discussing risk estimates can be challenging. For example, with statins, instead of finding a policy that maximizes utility, providers often estimate 10-year ASCVD risk, and if it is greater than or equal to 10%, they recommend a statin. This is the official guideline. This is problematic, in my opinion. It assumes, among other things, that any two patients experience cardiovascular events and statin side effects in the same way. It also assumes that the information in the ASCVD covariates is sufficient to make a decision. This is not what is recommended in Against Diagnosis, I realize. Reading it, I also realize that the ASCVD-based approach is an improvement of how things would otherwise be.

With respect to inference, I agree. For example, the variance of the posterior of risk might increase as we approach a risk of 1, especially if 10-year survival is rare. Ultimately there is an expectation over utility and then an outer expectation over the risk model. Not an easy problem.

*For many medical problems, one is done when one makes the diagnosis, but this may be less about reality and more about the medical community’s historical focus on diagnosis (which may be related to antibiotics).

samw235711 · October 12, 2025, 12:54pm

It’s ironic, in reading Against Diagnosis, the hope was that ASCVD would move us away from dichotomization, but then ASCVD just somehow became dichotomized itself

VickersBiostats · October 13, 2025, 12:29pm

Context dependent! Remember “treat all” / “treat none” is short hand for “intervene and do something different from what you were going to do in all / none”. There is an analogy with the null hypothesis here (i.e. it generally means no difference between groups but in an equivalence trial, it means there is a difference between groups). In the prostate biopsy example, where everyone gets a biopsy, you can do a study and try to identify those who had a negative biopsy. In this case, yes, the patient would have been better off without the treatment (i.e. should not have gotten a biopsy).

Now take the case of patients who are getting chemotherapy and you know that this leads to a 25% relative risk reduction of recurrence. For the sake of illustration, we’ll say that patients should only take the drug if they have at least than a 3% absolute risk reduction of recurrence. You do a study and are able to distinguish two groups based on cancer stage. Using silly numbers for the point of illustration, one has a 45% recurrence rate and the other than has a 1% recurrence rate. Now you know that the first group must have had >3% absolute risk reduction (it is closer to 15%) and the other group must have had a very small risk reduction (about 0.3%). So that second group was treated, and you know they would have been better off without treatment.

Finally, remember that in decision curve analysis, you define the outcome as negative. So take the case of a group of patients undergoing surgery, where you look at complication rates and advise patients with high complication rates to avoid surgery. In this case, “treat” means “intervene and don’t do the surgery”

Uriah · October 13, 2025, 1:16pm

Thanks! I have alot to ask about the other paragraphs, but I think the last one kind of ansers my question.

So you either have:

Fully non-treated population (no one had a surgery) and you consider a harmful surgery in order to prevent an outcome: p( Y=1 | trt=0 ) assuming a surgery might or might not reduce the counterfactual probability p( Y=1 | observed-trt=0, do(trt) ) and taking into account treatment-harm that might be worth the risk reduction.
Fully treated population (everyone had a surgery) and you consider avoiding the harmful surgery for some in order to prevent unecessary complication.

Would you consider a complication as an outcome on that case? Or the original one?

Doesn’t it justify my first claim?

Under this definition if everyone had a surgery, no one was “treated” (i.e intervened and didn’t do the surgery).

VickersBiostats · October 14, 2025, 9:39pm

I’m not sure what I can add other than:

The outcome is always a negative event you are trying to avoid or identify
The “treatment” is something you’d do to lower the risk of or identify the negative event
Whether you can run a decision curve analysis on a data set in which everyone has had the treatment (or not had the treatment) depends on the context. For instance, if the negative event is a surgical complication, and the “treatment” is to avoid surgery, it only makes sense to analyze a data set in which everyone had surgery (i.e. no-one was “treated”). On the other hand, if the negative event is cancer recurrence, and the treatment is to give adjuvant chemotherapy, you can analyze a data set in which no-one had chemotherapy or in which everyone had chemotherapy (in both cases you are making the same assumption about relatively constant relative risk). In the case of cancer biopsy, where the negative event is having cancer, you have to run the analysis on a data set where everyone was biopsied (i.e. everyone treated). So the validation set sometimes has to consist of untreated patients, sometimes has to consist of treated patients, and in some cases it can be either, as long as the cohort is either 100% treated or 100% untreated. (you can actually do some imputation if you have a mixed cohort of treated and untreated, but i wouldn’t recommend it particularly)

samw235711 · December 1, 2025, 2:12am

I have been thinking a bit more about this, after talking about PSA screening elsewhere.

Overall, I think that it’s problematic at the modeling stage to allow for calibration only within a certain range of (0,1).

I reread the decision curve analysis paper and saw that it rightly allowed the threshold to vary (rather than just fixing the threshold and reporting a single statistic). As I read the posts by others, I see that this varying of the threshold in decision curve analysis was mentioned. I think this sort of resolves the disagreement between thresholds above; thresholds are fine as long as they are patient specific—i.e., as long as they can be set for each individual decision make by an individual patient. The decision curve analysis threshold is a function of patient utility.

In theory, at the stage of modeling as opposed to decision making, because there is variability in individual patient utilities (and therefore in the thresholds), there cannot be a global threshold, above which a model need not be well calibrated.

As an example, consider a patient who needs to know with almost certainty that they have an MI in order to stay in the ED, because the disutility in staying is so high (this might occur with a low risk MI and a patient who has distrust of the healthcare system). Another patient might want to stay no matter what, even if the risk is low. The model would need to be well calibrated for high and low probabilities (and anything in between, since most patients fall between these extremes). This said, MI is a special example because for most people a false negative is very bad, as mentioned by others.

In practice, as I have just done in the last sentence, we can probably make some assumptions about homogeneity of utilities over patients (health is, in my opinion, better behaved in terms of its relationship with utility than are other things). Or, we could say something like — “our model does not give good estimates in your range of interest,” and go from there, as was proposed by others. However, it’s difficult to do the latter without more studies on how patient utilities are distributed for different problems. Until these distributions are well understood, a threshold during the modeling stage risks removing patient autonomy. Leave the threshold to be set by the decision makers in the field (as recommended by others), and instead be very specific about the uncertainty of the calibration estimator over the entire curve. It is natural that a calibration curve be poorly estimated in some areas, as a function of the data.

Either way, my sense is that to compute one’s full expected utility, the model would need to be well calibrated everywhere there is an event (ie, it does not need to be well calibrated for baseline covariates and outcomes that do not exist), because one needs to be able to integrate, within the expectation, over events (there is an interesting duality I think between expected utility integrating over events and the Pauker threshold varying over patients). Computing expected utility can be important, if one wishes to optimize over a policy.

kiwiskiNZ · December 1, 2025, 2:57am

I appreciate the point Sam. I think it does depend on the context the model is being applied. Dealing with MI in ED is where I spend 80% of my time and I am not convinced of the need for great calibration except at the low-end of risk. Interestingly, there are few patients in between high and low probability. The accelerated diagnostic pathways in place can classify as low-risk with a sensitivity of ~99% 30-70% (depending on prevalence) of patients after only one blood test, ECG and clinical exam. Apart from one case diagnosed with an ECG (STEMI) the ED docs don’t really diagnose - rather decide on disposition. If they true risk is 20% or 50% they don’t care - it is plenty high enough for them to admit to a cardiology ward for further investigation to try and diagnose. To that end, the high-end calibration isn’t as important as the low-end because the cost of poor calibration isn’t so great. That said, there are likely other conditions in the ED where poor calibration at the high-end may delay treatment to the detriment of the patient. I do like your “our model does not give good estimates in your range of interest” (& its corollary) statement. Something I will try and keep in mind when talking with physician colleagues about models.

robinblythe · December 1, 2025, 1:41pm

It would be fairly straightforward to create a loss function for calibration. You could either optimise for it (e.g. in a ML context) or use it as an index. Distance from the identity at each point in the flexible curve would need to be considered, of course, but the distance from some value - I would leave this up to the user but it can be a threshold probability, prevalence of the event, whatever - would be required to weight the miscalibration in the extremes as less important.

I have been thinking of writing this up in a paper for a while, but I would be worried about being overly prescriptive about interpreting calibration, and would also be wary of throwing yet another metric out there when people don’t use calibration curves enough as it is.

Edit: hah, Peter Austin has beaten me to the punch by over half a decade: 10.1002/sim.8281

samw235711 · December 5, 2025, 2:02pm

Take care - while ED mostly concerned with the low range probabilities needed to make decisions about disposition, the cardiology floor, which has to make decisions about kidney-compromising (sometimes leading to permanent dialysis) angiography, is concerned with probabilities in a higher range. We would not want a model that is not well calibrated in the higher ranges to be accidentally used in the latter case. The cutoff between low and higher range is unclear to me, and may depend on the patient. So, as you agreed, disclosing uncertainty across the model calibration would be helpful.