Multiple imputation and bootstrapping for internal validation

What is the optimal way to combine multiple imputation and bootstrap to quantify uncertainty in discrimination and calibration metrics for the internal validation of diagnostic or prognostic models?

I am constructing a proportional odds logistic model for the purpose of predicting prognosis of Covid-19 at the time of first contact with the healthcare system (see preprint here). The model is derived on prospectively collected data on roughly 4.500 consecutively diagnosed SARS-CoV-2 positive patients in Iceland.

There are some missing data. I aim to combine multiple imputation (MI) with bootstrapping to report confidence intervals around discrimination and calibration measures of the fitted model. Several simulation studies have shown that some combinations of MI and bootstrap may yield asymptotically biased results and incorrect nominal coverage of the confidence intervals under differing circumstances, see Bartlett & Hughes, Schomaker & Heumann and Wahl et al.

My aim is to understand these papers in more detail and specifically how they would apply to my problem using aregImpute() and fit.mult.impute() from the Hmisc package.

As far as I can tell, the linked papers above (with slight variations) describe three combinations of MI and bootstrap, which they then compare. The Bartlett & Hughes paper summarises these three approaches so well that any attempt of mine to restate them would not do them justice. In the words of Bartlett and Hughes, the three approaches are

  1. Multiple imputation and then bootstrap to estimate the within-imputation complete data variance:

  2. Multiple imputation followed by a pooled percentile of bootstrap resamples with replacement

  1. Bootstrap from the observed data and then apply MI to each bootstrap sample with replacement, then taking a pooled percentile.

Bartlett and Hughes find that only 3) Boot MI percentile approach will provide nominal coverage under uncongeniality or misspecification of the imputation and analysis procedures. If I have misunderstood or mischaracterised the problem, I would appreciate any and all corrections or suggestions for further reading.

Now to the question. I have trouble understanding exactly how aregImpute() performs its multiple imputation and how it fits into the above classification. The way I understand it, aregImpute() first takes bootstraps with replacement from the observations in the entire dataset and the fits the multiple imputation model on the bootstrap sample. It then imputes the missing value of the target variable by finding a non-missing observed value in the entire dataset using complex weighting, and imputing this value. It seems to me this most closely matches to 3) Boot MI, where M = 1, i.e. only one imputation is performed for each bootstrap resample.

The problem arises when I would like to construct a confidence interval around discrimination metrics (such as the C-statistic). calibration metrics (calibration intercept, calibration slope), and most importantly, confidence intervals around the non-parametric calibration curve between observed and predicted risk. In the pre-print, I followed my interpretation of the general recommendations in Regression modelling strategies and Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating, to take a randomly selected singly imputed dataset and repeatedly fit the model to bootstrap resamples with replacement of the singly imputed dataset. I do this 2000 times and construct the confidence intervals using a percentile based approach. During review it was pointed out to me that this approach does not include any of the between-imputation uncertainty.

I can think of two ways to resolve this but I am unsure if they 1) are valid and 2) if they are compatible with any of the general approaches described by the above papers.

The first solution would be to iterate the process at the level of aregImpute(), such that for each step in the iteration process, I would constructed M = 5 imputed datasets, apply fit.mult.impute() to apply the proportional odds regression model to obtain the relevant statistics that are average over the 5 imputed datasets, and extract the relevant statistic. I would repeat this 1000 times and construct a percentile based confidence interval [alpha, 1 - alpha]. This would be enormously computationally expensive as I would have to construct 5000 imputed datasets by fitting 1000 imputation models and then fit an analysis model 1000 times. This seems excessive.

The second solution which I am more happy with would be to iterate over a random draw of singly imputed datasets. First I would run aregImpute() producing M = 100 imputed datasets. Then I would iterate such that for each iteration, a randomly selected singly imputed dataset is selected, and the proportional odds regression model applied to a bootstrap resample with replacement from the selected dataset. Finally, the relevant statistic would be extracted. This would be repeated 1000 times and a percentile based confidence interval would be constructed. This would be less computationally expensive but comes with the drawback of the fitted models ignoring the imputation-corrected variance-covariance matrix.

1 Like

I now see that this has been discussed on datamethods before with the some of the same papers being referenced Bootstrapping CIs in multiply imputed data and Internal model validation (Bootstrapping) and Multiple Imputation

The discussion there was excellent and I hope in can be continued here.

I found a very interesting paper by Noma et al that discusses the problem of obtaining confidence intervals for optimism-corrected statistics from multivariable prediction models. This is a very primitive attempt at their ‘Algorithm 2’, with the added benefit of including the imputation variability. Hopefully this will either spur discussion or at least help someone facing similar challenges. I should warn that this takes roughly 8 hours to complete with 200 imputations and 1000 bootstrap resamples.

As far as I can tell this procedure is a hybrid of approach 2 and 3 from the Bartlett & Hughes paper. First 200 imputed datasets are created, one for each of the 200 bootstrap samples from the original data. Then the parameter is estimated as the mean of 1000 bootstrap samples with replacement from each of the 200 imputed datasets. Finally, a (1 - 2 alpha)% percentile confidence interval is formed by taking the [alpha, 1-alpha] empirical percentiles of the 200 parameters. In this case, the parameters of interest are optimism-corrected discrimination and calibration statistics for a proportional odds model.

n_impute ← 200 # number of imputations
n_events ← 3 # number of event outcomes in proportional odds model
n_B ← 1000 # number of bootstraps for optimism correction

imputation_model ← aregImpute(
formula = ~ outcome + covariates + auxillary_variables,
data = data, n.impute = n_impute, x = TRUE

val_list ← vector(mode = “list”, length = n_events) # initialize list
for (i in 1:n_impute) {
data_forloop ←
data = data,
imputation = i,
list.out = TRUE,
pr = FALSE,
check = FALSE
mod_forloop ← lrm(
formula = outcome ~ covariates,
data = sample(data_forloop, nrow(data_forloop), replace = TRUE),
x = TRUE, y = TRUE
for(j in 1:n_events) {
val_list[[j]][[i]] ← validate(mod_forloop, kint = j, B = n_B)

the following code extracts the matrix of discrimination and calibration measures returned from the validate() function, by outcome level.

df_kint1 ← data.frame(
measure = row.names(, val_list[[1]])),, val_list[[1]])
df_kint2 ← data.frame(
measure = row.names(, val_list[[2]])),, val_list[[2]])
df_kint3 ← data.frame(
measure = row.names(, val_list[[3]])),, val_list[[3]])

Finally, the optimism correct measures for each outcome can easily be extracted. Here I show how to extract the mean and [0.025, 0.975] interval

mean(df_kint3[which(df_kint3$measure == “Emax”), “index.corrected”])
quantile(df_kint3[which(df_kint3$measure == “Emax”), “index.corrected”], probs = c(0.025, 0.975))

If have optimised the computational time by an order of magnitude by parallelising. It now takes roughly 10 minutes. For anyone interested:

outer_B ← 2000
inner_B ← 10
my.cluster ← parallel::makeForkCluster(7)
doParallel::registerDoParallel(cl = my.cluster)
doRNG::registerDoRNG(seed = 1) # each fork of each iteration is independent and reproducible
val_metrics_1 ← foreach(i = 1:outer_B) %dopar {

data1 ←
data = data,
imputation = sample(imputation$n.impute, 1),
list.out = TRUE,
pr = FALSE,
check = FALSE
mod1 ← lrm(
formula = outcome ~ covariates,
data = sample(data1, nrow(data1), replace = TRUE), x = TRUE, y = TRUE)
validate(mod1, kint = 1, B = inner_B)}

parallel::stopCluster(cl = my.cluster)

I also have a parallelised solution to extract confidence intervals around an optimism-corrected calibration curve that includes the variability due to both imputation and bootstrap resampling.