Calibration plot, calibration intercept, calibration slope

Maybe I don’t quite understand calibration intercept and slope.

If you have more of an overestimation or underestimation in the first place, like the two green lines below, shouldn’t this be reflected in the intercept, i.e., a negative intercept and a positive intercept, respectively, but they are both 0.

The red line, on the other hand, is typical overfitting (too large prediction on the upper end, too low prediction on the lower end). So, I expect a slope below 1 to adjust for that; here, it is 0.34, which makes sense. For the intercept, however, I would expect something close to 0 here instead of a 0 intercept for the green lines.

All the reported slopes make sense to me, but not sure about the intercepts.

image

This figure was taken from:
Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, Topic Group ‘Evaluating diagnostic tests and prediction models’ of the STRATOS initiative. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17(1):230. http://dx.doi.org/10.1186/s12916-019-1466-7

1 Like

This has always been a bit strange for me as well. I believe the unusual intercept is largely due to the logistic calibration process which fits another logistic regression to the log odds, so it would be like drawing a linear model through the non-linear predicted values. Happy be to be corrected though.

In my opinion, the LOWESS smoother is a lot more informative than calibration slope/intercept as the intercept and slope can be misleading.

I think I didn’t quite catch where the confusion stemmed from in the first post.

What “intercept” refers to in the aforementioned paper is that the mean of predicted risks is ~equal to the risk observed in the sample. It is difficult to eyeball this from the graph because you do not know what the distribution of predicted risks is like.

For example, let’s take the green line with slope = 2.5 as an example (which is underfit; i.e. the lowest probabilities are not low enough and the highest probabilities are not high enough).

  1. If there were a lot of patients with a predicted risk between 0 and 0.2, you’d have a lot of small errors (where “errors” are overpredicted risks)

  2. If there were a few patients between 0.2 to 1, you’d have a few large errors (where “errors” are underpredicted risks)

  3. The large number of small overpredictions (from 1.) will be cancelled out by the small number of large underpredictions (from 2.)

  4. Therefore, the model will not have, on average, underestimated or overestimated the risk observed in the sample (and the intercept will thus be 0)

1 Like

It’s not exactly the mean since we’re on the logit scale. Linear logistic calibration is only misleading if it’s linearity assumption is badly violated.

2 Likes

This is a fair point. What about when the distribution of predicted risks occupies only a small part of the range (0, 1)? In my experience when predicting rare events, the resulting linear logistic calibration curve is extrapolated over the full range and can lead to some strange results.

1 Like

From https://www.fharrell.com/post/modplan/

We typically plot the calibration curve from the 10th smallest predicted risk to the 10th largest in the data.

I’m not sure if that’s a standard or if people use alternative criteria at times (e.g., 95th percentile of predicted risk). I do think it doesn’t make sense to display over the full range from 0 to 1 if the vast majority of predictions are between say 0 and 0.1 (because then you’d be compressing the range where most of your predictions are by extending the axes all the way to 1).

1 Like

Thanks, this actually explains a lot. When doing rare event prediction I typically have a lot of data (1m+ usually) so the top 10 may still be under 0.001th percentile as you say.

Incidentally, I wonder whether it might be prudent to restrict the LOWESS smoother to the 95% interval of predicted risks as well, but this smacks of conveniently dropping “outliers” and I am hesitant to do so.

2 Likes

Don’t think that is like dropping outliers. It’s just concentrating on the region where you have information.

1 Like

Alternatively, we often calculate calibration curves with some uncertainty measure around them. Regions with fewer observations just get inflated standard errors and you can assess the coverage relative to the diagonal.

3 Likes