Why random effects when developing prediction models?

Pavel_Roshanov · March 17, 2022, 3:35am

Hello to all.

I was reading this paper A clinical prediction rule for the diagnosis of coronary artery disease: validation, updating, and extension in the EHJ in 2011.

The authors validate and extend a model to estimate risk of obstructive coronary disease.

One of the models is said to have a random intercept and–for one of the predictors–a random slope based on the hospital from which the patients were recruited.

Can someone offer an explanation of the value of a prediction model with random effects? There is still a single intercept and single predictor coefficients presented. When readers apply the model to patients outside of any of the study centres, how can they possibly apply the cluster-specific slope or intercept? I have not undertaken such an analysis because of these impressions and I wonder if I am mistaken.

Thank you.

f2harrell · March 17, 2022, 12:42pm

I haven’t seen a paper before that added random slopes when the clusters are hospitals. Concerning random intercepts, these are logical. They absorb between-hospital differences (whether the random effects can be assumed to be Gaussian as usually done is a matter of debate). Random effects in a sense allow you better to extrapolate results to different hospitals not in the sample, because the alternative of ignoring hospital differences just covers up the problem and will result in some kind of mean estimate for all future observations. Richard McElreath’s Bayesian book Statistical Rethinking has really nice coverage of these concepts. In extrapolation you could get an average by averaging over random effects our you could get a kind of prediction interval that recognizes how different a future hospital could be.

Pavel_Roshanov · March 17, 2022, 3:29pm

Thank you!

This is what I struggle with: if I simply calculate an average intercept for the model to put in, say, a risk calculator, am I really doing much good by fitting a variable intercept model in the first place?

f2harrell · March 17, 2022, 4:08pm

It can, if there is wide variation. Otherwise all the regression coefficients can be a little bit too close to zero.

Pavel_Roshanov · March 17, 2022, 5:22pm

So failing to fit a variable intercept (i.e. random intercept model) in the first place can give different point estimates of predictor coefficients compared to a model with a variable intercept?

My impression was that the variable intercept model would affect the confidence intervals of the coefficients (make them narrower) but not the point estimates. Perhaps this is sometimes incorrect because the proportion of patients with a given value of a predictor (say, men vs women) differs across clusters (e.g. hospitals), and some of the effect being attributed to sex is just an effect of hospital membership. If this is true, however, then hospitals are acting as confounders of the relationship between sex and outcome. Would I not need a variable (random) effect for the predictor effects to capture that?

Practical question: how to average over the random intercept to calculate a single intercept to include in an equation for a risk calculator? Is this literally just extracting the random intercept from the mixed model object and calculating the mean?

f2harrell · March 18, 2022, 5:27pm

I don’t know about the last questions, but to your first question yes the coefficients can be wrong, usually too close to zero. Failure to account for understood sources of heterogeneity, just like omitting an important predictor in a binary logistic model, results in suboptimal estimates.