Penalized regression with imputation in `rms`

dpananos · July 17, 2021, 7:55pm

A clinician I am working with has indicated interest in developing a clincial prediction model. Their data contain missing information for some of the covariates as well as the outcome. I know I can use multiple imputation via rms::aregImpute to address the issue of missing data.

One of my main concerns is the calibration of the model. This paper by Van Calster et.al reccomends using penalized regression in order to prevent overfitting. I know rms has an implementation for penalization via the pentrace function, but pentrace does not seem to accept a model fit via fit.mult.impute, returning the following error:

Error in pentrace(f2) : fitter not valid

What are my options here to introduce some moderate penalization into the model while accommodating imputation? Is this possible via rms or is this a scenario where I would “roll my own” estimator?

Your input is appreciated and I’m willing to share relevant details where needed.

f2harrell · July 18, 2021, 12:19pm

We need to move to full Bayesian modeling to be able to handle more complexities at one time. The problem you are running into is that penalized maximum likelihood estimation needs a likelihood, but multiple imputation with Rubin’s rule for getting the final covariance matrix only leads to Wald statistics, i.e., relies on asymptotic normal distributions. You can try to do penalized estimation for each imputed dataset, hoping to show that the same penalty works for all imputed datasets, then hope Rubin’s rule works for the average of all the \hat{\beta}.

fanstev1 · July 20, 2021, 8:20pm

Hi Prof. Harrell,

As a follow-up question to your comments, if I fit penalized regression on each of the imputed datasets with different penalties without going for a full Bayesian model, can Rubin’s rule be applied when I combine the estimated coefficients because the form of the object functions (i.e. penalized likelihood function) are different.

In addition, I would like to solicit your opinion about fixing the penalty for all imputed data (ie. the same penalty across all imputed data) and apply Rubin’s rule to combine the estimated coefficient. For example, by fitting penalized regression to each imputed dataset, the "optimized’ penalties across different imputed dataset are often similar. In this case, I could fix the penalty parameter in the regression for all imputed datasets and combine the estimated coefficients using Rubin’s rule.

f2harrell · July 21, 2021, 11:55am

I don’t know of any research related to those good questions.

Agnes_Cororaton · July 19, 2023, 8:00pm

@f2harrell Hi Dr Harrell,
Is it acceptable to use pentrace() when I have rare binary outcomes for a logistic regression (n=25 events and 2000 controls) to prevent overfitting?
When using pentrace() for a logistic regression model, can I still report interpret the coefficients as odds ratio even with the penalty using the summary()?
Also how do I know what penalty to use
penalty=list (nonlinear =seq(.25 ,10, by=.25 ))) per page 270 of the RMS book or
pentrace(f, list(simple=c(0,.2,.4), nonlinear=c(0,.2,.4,.8,1))) per the help screen on the function in R

f2harrell · July 20, 2023, 2:23pm

Without the use of Bayesian models we don’t have a thought-out approach to simultaneously handle missing data and penalization. The following approach may work fairly well.

Develop multiple imputations using aregImpute
Run a loop over the multiple imputations
For each imputation fetch the filled-in dataset using the completer function
On that completed dataset run pentrace
Over all imputations see if there is consistency in the optimum amount of penalization, and possibly take the mean penalties in the final fit

The major problem with all this is that with 25 events you don’t have enough information to solve for the optimum penalty so this is not very worth trying.

All you can do here is to use data reduction / unsupervised learning to reduce all predictors down to a single number, and to fit that against outcome. Missing data will be a challenge doing even that.

P.S. In future look at the top of RMS Discussions - modeling strategy - Datamethods Discussion Forum to see which existing topic to post to. This one should have gone under chapter 3 or 5.