When applied to a different population than that where it was developed, it is common for prediction models to show bad calibration-in-the-large, meaning that the average predicted probability doesn’t match the overall outcome proportion. This is because, e.g., disease prevalence may vary widely across populations.
Say, for example, that I have a logistic regression model developed and validated in one population (e.g., in UK), with good calibration. I now wish to apply this model to another population (e.g., in Brazil). All I know is that the overall outcome prevalence in Brazil is, say, 20% (18% – 22%), while in UK it is 10% (8% – 12%).
Is there a way to update my model’s intercept so that I expect to achieve reasonable calibration-in-the-large for the Brazilian population, without any additional data from Brazil (except for the prevalence estimate)?
I thought about just taking the linear predictor in the development data, subtract the model’s intercept, and then finding the real value that gives me an average predicted probability that matches the target prevalence. This real value would be my new intercept for Brazil. But that sounds “hacky” and ignores uncertainty in the prevalence estimate. Of course, I’m assuming predictor effects are constant across populations.
Are there any references around this? I only found model updating strategies that use full data from the target population. I’m hoping the prevalence estimate is some kind of “sufficient statistic” for the model’s intercept.
The bottom line is: it would be nice if people could apply prediction models and, similar to using “pre-test probabilities” in binary tests, get probability predictions that account for local expected prevalence. If most models are developed and validated in the developed world, people from places like Brazil may have little hope for calibrated predictions.
Thank you very much!
Your approach is a good one. One tweak is that the linear predictor from the existing model (with or without subtraction of the intercept as this won’t matter in the end) needs to be included in the new fit as an “offset” variable so it’s slope is forced to be 1.0. This will result in an intercept increment used to correct the model.
I wouldn’t always say that failure of calibration-in-the-large is due to a difference in prevalence. Prevalence is a hard concept to define (unlike conditional probabilities). Failure to calibrate is likely to result from omission of an important risk factor whose distribution varies over countries.
Thank you, Dr. Harrell! That is good to know, I didn’t have the intuition about the omission of important risk factors. If the next population is older on average, and age is included in the model, then the above procedure may cause miscalibration in the large, as I understand. Is that correct?
If age is included in the model and is properly transformed (e.g., is modeled flexibly without assuming linearity) then there is a good chance of having no miscalibration due to population age differences.
I see. In that situation, I’m afraid the procedure I described would actually do harm rather than benefit, since I would be mistakenly modifying an appropriate intercept.
Thank you very much. This makes me see how important external validation studies are, troubling to see so few.
But you can counteract the problem by having a wider spectrum of patients in the model development sample.
Side note: I’ve taken multiple courses on machine learning, genomics, and many other “prediction-related hot topics” - all of which mention clinical prediction at some point. I’ve taken them online from MIT and Harvard, and in person from The University of British Columbia. None of them ever mentioned these details about calibration, study design, and medical decision-making. Glad to interact with this community now.