RMS Missing Data

Regression Modeling Strategies: Missing Data

This is the third of several connected topics organized around chapters in Regression Modeling Strategies. The purposes of these topics are to introduce key concepts in the chapter and to provide a place for questions, answers, and discussion around the chapter’s topics.

Overview | Course Notes

Additional links


Q&A From May 2021 Course

  1. Can what aregImpute() does ‘under the hood’ be described as MICE, or is it a different approach? FH: different: bootstrap vs. draws from an approximate multivariate normal distribution for the regression coefficients
  2. If we don’t know/cannot know whether the mechanism is MAR or MCAR, can we just assume that missingness is MAR and then do MI? FH: it all depends on study design / measurement procedures
  3. How can we compare casewise and MI models? Is there any option for validation/calibration after using aregImpute? (I did not find any discussion on datamethods about that). FH: I have some links somewhere about validation in this context; it’s not totally worked out.
  4. How important is the variables order in the aregImpute command? FH: can be important if lots of simultaneous missings
  5. How many imputations should we use? How_many_imputations package in R and Stata by Von Hippel. SAGE Journals: Your gateway to world-class journal research

He has a blog where he explains the concept behind it: How many imputations do you need? | Statistical Horizons

  1. How should missing data be handled when the analysis is planned for a multistate markov model? As I understand, Rubin’s rule can result in row sums within the transition intensity matrix that do not sum to zero and so pooling cannot be performed. Are there alternative pooling methods for MI? FH: I was not aware of that. I’ll bet if you did enough imputations the row sum problem gets small enough to round it away.
  2. In environmental studies, there are often missing data for variables that have strong spatial pattern, usually strong positive auto- and cross-correlations. I’ve seen Kriging used for estimating the missing data or sometimes simpler methods suffice. Frank mentioned that he’ll be presenting how to handle missing longitudinal data. Will you be presenting methods for dealing with missing data based on the spatial location of the case with the missing variable(s)? Does spatial imputation imply any special issues? FH: sorry not versed in that
  3. Why does GEE require MCAR? Are linear mixed models better? FH: we will cover this
  4. Isn’t random forest for missing value imputation similar to K nearest neighbors in the extreme case? How is KNN different from the multiple imputation methods covered in the class? FH: I’m not sure what is the extreme case. In general RF is not like KNN. Our multiple imputation methods are based on regression modeling to get predicted values to match on (in one dimension).
  5. How do you utilize multiple imputation with an external validation set? For example, you can’t get the average beta, because you are using the betas from the training set. FH: I’m not understanding the question.
  6. Could you elaborate more on how we deal with missing outcomes and covariates problems when the MAR assumption still holds? So we have both missing outcome and baseline covariates in observational data. As Brandon pointed out, the follow-up rates in the registry data are pretty bad in most long-term registries. Are there any related references that discuss the missing outcome issues, especially when our outcome variable is the most missing part of the dataset? FH: A paper in the orthopedics literature claimed that if you have ⅘ of patients followed that’s good enough. That doesn’t really follow; it depends entirely on the reasons for missing. The only way to really address this is to spend the time and effort to get 100% of the data from a random sample of non-responders so that you can learn about selection bias. Much of observational data is not research quality. Studies need to be prospective with a protocol and mechanisms for getting nearly complete follow-up.
  7. Unlike the fully Bayesian approach, the MI approach often divides the process of dealing with missingness (the imputation stage) from the causal analysis of the completed data. This practice could lead to biased causal estimates when the imputation model is not compatible with the analysis model (e.g, Bartlett and others, 2015). How to ensure the compatibility/congeniality (a concept proposed by Xiao-Li Meng) between the imputation model and analysis model in practice? FH: Great question. There are good papers about compatible imputation but I haven’t seen one in the causal inference framework.
  8. How do we generate the partial effect plots after multiple imputation? Are the partial effect plots generated from the fitted model (using pooled coefficients based on Rubin’s rule)? In addition, how to pool c-statistic (or other model performance measures: adjusted R square and brier statistics) after the multiple imputation? Do we just take the average of c-statistics of each imputation? How to account for the uncertainty of pooled c-statistic? FH: The c-index is less clear but averaging it over imputations is reasonable. Partial effect plots are easy: just use the pooled regression coefficients and the corrected variance-covariance matrix (automatic if using fit.mult.impute and rms::Predict).
  9. 3.8 er an invasive definitive test, Y. It would be conduc
  10. 3.8 We talked today about using the outcome during multiple imputation. However, if one is fitting a mixed effect model on longitudinal data where the outcome is measured every two year (with some missing values) and want to use multiple imputation for a number of predictors measured at baseline, would it be okay not to include the outcome during multiple imputation? FH: No, you must include the outcome. Sometimes I time-bin the outcome, e.g. add these variables to the imputation model: Y1 = measurement closest to one year post baseline, Y2 = measurement closest to 2 years post baseline, etc.
  11. Interesting materials by Lee et al. (DOI:Redirecting) and by Carpenter et al. (https://doi.org/10.1002/bimj.202000196) about missing data. FH: You may want to add this to the supplemental reading list.
  12. I am proposing a 3-yr longitudinal cohort screening study using {clinical, blood, echo, QOL} data collected semiannually or annually to predict outcome pted during screening if called for clinically or at the end of follow-up if risk is sufficiently high to justify conducting it (using a risk level a bit lower than the clinical threshold). Thus, the outcome is binary {Tested/D+, Tested/D- } plus NA. NA proportion could be high. But based on predicted values could yield ordinal outcome, as in radiology: {Tested/Present, Unknown/Likely, Unknown/Unlikely, Tested/Absent } to reduce NA. Viable? Alternatives?
  13. Are there optimal methods to perform imputation for ordinal variables when the purpose is not prediction in itself (such as confirmatory factor analysis). Or say in the case where in a structural equation model the outcome is a latent factor obtained from those ordinal indicators? Not an area I’ve studied, but sometimes we treat an ordinal variable as interval scaled when using PMM and things still work.
  14. For multiple imputation done with “mice” package, we can specify the methods to impute each variable such as “pmm” “logreg” and “polyreg”. “pmm” was introduced in the lecture and the sample code wasn’t specifying the methods. I wonder if it is better to treat binary and categorical variables with “pmm” in MI or to specify them differently? PMM is still my preferred default but polytomous logistic model should work well for categorical variables being imputed. Just runs slower.
  15. I wonder if you know how to pull FMI values for each imputed variables after the multiple imputation was done and before we fit the final model with “mice” package?
  16. The”Summary and Rough Guidelines” for missing data provides recommendations based on f (proportion of observations having any variables missing). What is the most efficient way to calculate that? Is there a function in Hmisc to calculate f?

I have a model built using multiple imputation on one dataset (let’s call it “old_dat”). I am looking to make predictions on a new dataset from this model (“new_dat”) to assess whether there was balance between 2 treatment arms in terms of prognostic factors. old_dat contained no interactions. How should I handle the missing data in this new cohort? I can think of 3 options:

  1. Impute median values of missing covariates since I am only looking to assess balance between the treatment arms and therefore am not interested in calibration so much as discrimination
  2. Create new imputation model using new_dat alone and then make predictions from old_dat onto new_dat
  3. Stack old_dat and new_dat, make new imputation model to impute missing new_dat values, but then still use model originally trained on old_dat to make predictions on new_dat?
    Which do you think is preferred (if any)?

These are all viable solutions. Using the multiple imputation model developed in the first sample is what I usually recommend but if missingness patterns are different then you could argue to build the imputation model from just the second sample. But to your original goal I would look at a simpler solution: Compute the proportion of missing values by treatment and look at the full distributions of non-missing values by treatment.

I recently used aregImpute to handle missing values in the dataset used for developing a risk prediction model and I computed pooled estimates by fit.mult.impute. However, I would like to extract imputed data to have a “fixed” dataset for reproducible analyses without the need for running a new imputation process. I extracted the matrix containing the final imputations from the aregImpute object and made a conversion to data.frame. The resulting object had no missing values but when I ran the model (same fitter) I got inconsistent results. Do you have any suggestions?

It’s hard to know that without seeing the code. Please add code into a synctactically correct quoted block, streamlining the code to the key things we need to see.

The way I would distributed the result to others would be to give them the original data frame (with missings) and the aregImpute object along with code that for each of the multiple imputations inserts the right aregImpute imputation into the data frame to fill in the NAs.

If you were to use variables in the imputation of predictors that are not in your model as predictors are there any additional assumptions or factors you would need to consider by introducing some information from these variables into your model through the imputation process?

A good question. The imputation model can be richer than the outcome model (but can’t be “poorer”). It is OK for example to include physician ID or zip code in the imputation model when you would not want to include that many categories in the outcome model. I don’t know of any special considerations you’d need to have when including these “extra” variables. The literature points to the compatibility of the two models as being more important for valid inference, for example having the same interactions in both models.

In other settings one has to respect the causal path, e.g., not include future measurements as if they were baseline variables. That restriction is not necessary for the imputation model. It would be interesting to see whether things like collider bias could ruin an imputation model.

1 Like

Can I use single conditional imputation if missingness is less than 15%?

As mentioned in the RMS course notes, the reason for missing is far more important than the proportion of missing values. What can you say about the reasons?

In general, missingness more than around 3% may require multiple imputation, but that also depends on the number of non-missing values.

thanks that is great


Dear prof,
with 20 % missing data in some candidate predictors(MAR/MCAR) , is it not better we fit the pre specified models in imputed datasets and pool the results , then select the better model based on AIC(non nested models) and proceed(I have only enough effective sample size for degrees of freedom needed for two pre specified models.)

1)or just fit the models in the original data set and then do sensitivity analysis with imputed data?
2)if using the imputed datasets, how to assess the calibration and discrimination etc?
thanks in advance?
3) any sample size calculation or planning needed if we anticipate missing data at the time of study planning?
thanks in advance

Speaking in general we want to make decisions on best available information. Best here is the pooled model fit but we lose the log likelihood in using Rubin’s pooling of multiple imputation fits. You can use Wald \chi^2 test statistics as approximations. But the real question is why use the word candidate when referring to predictors? Why do any model selection? Pre-specified models will preserve statistical inference.

Regarding model validation there are thoughts from others above and possibly at RMS Multivariable Modeling Strategies - modeling strategy - Datamethods Discussion Forum.

1 Like

Thanks Proff. We have already pre specified two models (one with only linear terms and other with non linear terms(rcs), but with the same predictors. That is the only difference between the two pre specified models. Those predictors were selected based on literature survey and expert opinion and we combined a few predictors from the candidate predictors(that was why we used the term candidate) and a few not available at the intended time of prediction and with missing data very high not considered for the final list of predictors. We don’t intend to do any data driven variable selection.

We thought, out of the two pre specified models , we would select and proceed with the one with less AIC.
We thought the one with linear terms could be simpler one to present as the final model, if AIC turns out to be smaller. And we had enough degrees of freedom for both models.
Should we pre specify just one model with nonlinear terms , now with more knots, with the degrees of freedom saved and proceed to present that one.
Now we will proceed as you advice

In that case I would run anova() on the multiply-imputed pooled model and look at the chunk test for all nonlinear effects. First look at its P-value then how does its Wald \chi^2 compare with twice its degrees of freedom.

Dear prof,

I am afraid I didn’t get the points clear. By “in that case” , you mean we go with the two prespecified models or just only one prespecified model with non linear terms?

I feel like I didn’t fully understand last sentence. Could you please ?
you meant only one prespecified model with non linear term and drop the non linear term if not significant on wald or LR chi square test ?

Judging \chi2 against twice its d.f. is equivalent to using AIC. And this is one place where model selection, choosing between exactly 2 models, is not a bad idea. Either keep all nonlinear terms or delete them all. P-value provides evidence against the assumption of linearity. AIC provides evidence of improved prediction in new samples. May be hard to choose between those two.


thanks you so much, prof

On this site we click the heart button to like/appreciate a post. But thanks anyway.


Sorry if this is not the most suitable thread for this question, but I hope the answer might be short.

In the context of an RCT, with the plan to analyze the data with an ANCOVA, would the following strategy be “okay” to impute missing data?

mice_imputed <- data.frame(
  original = titanic_train$Age,
  imputed_pmm = complete(mice(titanic_numeric, method = "pmm"))$Age,
  imputed_cart = complete(mice(titanic_numeric, method = "cart"))$Age,
  imputed_lasso = complete(mice(titanic_numeric, method = "lasso.norm"))$Age

From this post.

Pretty sure it is not the best option, but would it be okay? Thank you.