Optimising models for calibration

samw235711 · December 1, 2025, 2:12am

I have been thinking a bit more about this, after talking about PSA screening elsewhere.

Overall, I think that it’s problematic at the modeling stage to allow for calibration only within a certain range of (0,1).

I reread the decision curve analysis paper and saw that it rightly allowed the threshold to vary (rather than just fixing the threshold and reporting a single statistic). As I read the posts by others, I see that this varying of the threshold in decision curve analysis was mentioned. I think this sort of resolves the disagreement between thresholds above; thresholds are fine as long as they are patient specific—i.e., as long as they can be set for each individual decision make by an individual patient. The decision curve analysis threshold is a function of patient utility.

In theory, at the stage of modeling as opposed to decision making, because there is variability in individual patient utilities (and therefore in the thresholds), there cannot be a global threshold, above which a model need not be well calibrated.

As an example, consider a patient who needs to know with almost certainty that they have an MI in order to stay in the ED, because the disutility in staying is so high (this might occur with a low risk MI and a patient who has distrust of the healthcare system). Another patient might want to stay no matter what, even if the risk is low. The model would need to be well calibrated for high and low probabilities (and anything in between, since most patients fall between these extremes). This said, MI is a special example because for most people a false negative is very bad, as mentioned by others.

In practice, as I have just done in the last sentence, we can probably make some assumptions about homogeneity of utilities over patients (health is, in my opinion, better behaved in terms of its relationship with utility than are other things). Or, we could say something like — “our model does not give good estimates in your range of interest,” and go from there, as was proposed by others. However, it’s difficult to do the latter without more studies on how patient utilities are distributed for different problems. Until these distributions are well understood, a threshold during the modeling stage risks removing patient autonomy. Leave the threshold to be set by the decision makers in the field (as recommended by others), and instead be very specific about the uncertainty of the calibration estimator over the entire curve. It is natural that a calibration curve be poorly estimated in some areas, as a function of the data.

Either way, my sense is that to compute one’s full expected utility, the model would need to be well calibrated everywhere there is an event (ie, it does not need to be well calibrated for baseline covariates and outcomes that do not exist), because one needs to be able to integrate, within the expectation, over events (there is an interesting duality I think between expected utility integrating over events and the Pauker threshold varying over patients). Computing expected utility can be important, if one wishes to optimize over a policy.