Calibration Question



Good evening,

I was wondering whether someone more experienced in this , could shed some light into a question about model calibration.

I have two models (let’s call them model1 and model2) which were developed in two independent datasets (this was done not by choice, but by design: one of the models came from the literature and the source data are not available).

When I applied these models to (the same) validation dataset, I get two different calibration curve patterns:
a) the first model has a small amount of bias when I run the linear regression of the predicted values v.s. the observed ones), but the actual graph is a straight line with a regression slope of 0.92
b) the second model has smaller bias but its calibration slope is 1.47

My hunch is that former model may be a better way to go forward for future work, assuming that one can find a way to calibrate it. Is this the correct conclusion?


When you say small bias for the second model, are you referring to calibration-in-the-large, i.e., a sort of overall average bias such as the mean of all predictions not equaling the mean of all observed values? If so, I would not put much emphasis on the smaller amount of bias for the second model but instead would be worried by its extreme underfitting.

Calibration problems can be fixed, given a sample to re-calibrate to, but this often makes one wonder whether a third sample is needed to check the re-calibration.


Hi Frank,
Yes I am referring to calibration-in-the-large and the bias refers to a non-zero intercept estimate from the calibration plot (predicted vs observed values in a validation sample).
The estimates of the biases are within 50% of each other, but it so happens that the second model gave a p-value of 0.056 when we tested it against zero (which is something that the reviewers asked us to do, although it is kinda stupid in this context).

Actually, the context for the question is exactly what you wrote about having a third sample to recheck calibration. We are arguing in an “aims page” (grant) and “intro page” (paper draft) that this is indeed necessary to get a reliable model that can be applied widely and not rely on either on the first or second model as they are.

I was wondering whether one could offer a theoretical justification that the first model appears more promising that than the second.
The paper by Steyerberg et al does not provide much detail (or guidance) and in any case it is geared towards binary outcomes (rather than the continuous one at hand). Van Calster et al appears to be making the argument of preferring moderate calibration rather than poorly logistic fitting models, but is there another reference that can be used for continuous outcome models (e.g. liner regression)?


Hi Christos - you correctly infer that the biostatistics literature on calibration of continuous response models is lacking. This must be covered in another area of statistics, including econometrics, but I just don’t know.

This raises some general issues I’ve always struggled with. Mean squared error of predicted values, combining bias + precision, may be a gold standard. But if a model is to be put into field use it and it is not currently well enough calibrated, I suggest not to use it. If you do have a current way of calibrating the model (with some sort of verification) then we concentrate a bit more on predictive discrimination (e.g., R^2, allowing for a temporary re-calibration) to be emphasized when making the model choice.

To your question, I wouldn’t put all my eggs in the overall calibration (intercept) basket. I like measures such as E_{90} which is the 0.9 quantile of the absolute prediction error over the new sample (over its covariate distribution). This considers both intercept and slope-type calibration errors.


Thanks Frank. Is there a reference for the E_{90} you would suggest using?

The RMSE (which the machine learning people love) is not unreasonable metric to respond as it averages over the bias/variance tradeoff. However, it is difficult for most people to interpret since there are no commonly accepted cutoffs to declare superiority: I recall one of my post doc on microarray image analysis, in which the reviewers required all such measures e.g. MAE/RMSE, MAPE (Mean Absolute Percentage Error) and a few others, only for the reviewers (and the investigators) to use the eyeball test to determine which approach was the better.


My Regression Modeling Strategies book briefly discusses it, and it’s in the R rms packge calibrate functions.

My only thought about the judgement question is to consider ratios of RMSE and look for a worsening by a factor of 1.05 or 1.10. But it’s all very subjective. I also like MAE and median absolute error as these two are much more robust.


Calibration plots and statistics for linear outcomes are used in chemometrics.

Utility-based methods may be another way forward for clinical outcomes(e.g. NB) as it reflects both miscalibration and discriminatio. It lets you incorporate costs for misclassification.