 # Regression Modeling Strategies: Missing Data

This is the third of several connected topics organized around chapters in Regression Modeling Strategies. The purposes of these topics are to introduce key concepts in the chapter and to provide a place for questions, answers, and discussion around the chapter’s topics.

RMS3

# Q&A From May 2021 Course

1. Can what aregImpute() does ‘under the hood’ be described as MICE, or is it a different approach? FH: different: bootstrap vs. draws from an approximate multivariate normal distribution for the regression coefficients
2. If we don’t know/cannot know whether the mechanism is MAR or MCAR, can we just assume that missingness is MAR and then do MI? FH: it all depends on study design / measurement procedures
3. How can we compare casewise and MI models? Is there any option for validation/calibration after using aregImpute? (I did not find any discussion on datamethods about that). FH: I have some links somewhere about validation in this context; it’s not totally worked out.
4. How important is the variables order in the aregImpute command? FH: can be important if lots of simultaneous missings
5. How many imputations should we use? How_many_imputations package in R and Stata by Von Hippel. SAGE Journals: Your gateway to world-class journal research

He has a blog where he explains the concept behind it: How many imputations do you need? | Statistical Horizons

1. How should missing data be handled when the analysis is planned for a multistate markov model? As I understand, Rubin’s rule can result in row sums within the transition intensity matrix that do not sum to zero and so pooling cannot be performed. Are there alternative pooling methods for MI? FH: I was not aware of that. I’ll bet if you did enough imputations the row sum problem gets small enough to round it away.
2. In environmental studies, there are often missing data for variables that have strong spatial pattern, usually strong positive auto- and cross-correlations. I’ve seen Kriging used for estimating the missing data or sometimes simpler methods suffice. Frank mentioned that he’ll be presenting how to handle missing longitudinal data. Will you be presenting methods for dealing with missing data based on the spatial location of the case with the missing variable(s)? Does spatial imputation imply any special issues? FH: sorry not versed in that
3. Why does GEE require MCAR? Are linear mixed models better? FH: we will cover this
4. Isn’t random forest for missing value imputation similar to K nearest neighbors in the extreme case? How is KNN different from the multiple imputation methods covered in the class? FH: I’m not sure what is the extreme case. In general RF is not like KNN. Our multiple imputation methods are based on regression modeling to get predicted values to match on (in one dimension).
5. How do you utilize multiple imputation with an external validation set? For example, you can’t get the average beta, because you are using the betas from the training set. FH: I’m not understanding the question.
6. Could you elaborate more on how we deal with missing outcomes and covariates problems when the MAR assumption still holds? So we have both missing outcome and baseline covariates in observational data. As Brandon pointed out, the follow-up rates in the registry data are pretty bad in most long-term registries. Are there any related references that discuss the missing outcome issues, especially when our outcome variable is the most missing part of the dataset? FH: A paper in the orthopedics literature claimed that if you have ⅘ of patients followed that’s good enough. That doesn’t really follow; it depends entirely on the reasons for missing. The only way to really address this is to spend the time and effort to get 100% of the data from a random sample of non-responders so that you can learn about selection bias. Much of observational data is not research quality. Studies need to be prospective with a protocol and mechanisms for getting nearly complete follow-up.
7. Unlike the fully Bayesian approach, the MI approach often divides the process of dealing with missingness (the imputation stage) from the causal analysis of the completed data. This practice could lead to biased causal estimates when the imputation model is not compatible with the analysis model (e.g, Bartlett and others, 2015). How to ensure the compatibility/congeniality (a concept proposed by Xiao-Li Meng) between the imputation model and analysis model in practice? FH: Great question. There are good papers about compatible imputation but I haven’t seen one in the causal inference framework.
8. How do we generate the partial effect plots after multiple imputation? Are the partial effect plots generated from the fitted model (using pooled coefficients based on Rubin’s rule)? In addition, how to pool c-statistic (or other model performance measures: adjusted R square and brier statistics) after the multiple imputation? Do we just take the average of c-statistics of each imputation? How to account for the uncertainty of pooled c-statistic? FH: The c-index is less clear but averaging it over imputations is reasonable. Partial effect plots are easy: just use the pooled regression coefficients and the corrected variance-covariance matrix (automatic if using fit.mult.impute and rms::Predict).
9. 3.8 er an invasive definitive test, Y. It would be conduc
10. 3.8 We talked today about using the outcome during multiple imputation. However, if one is fitting a mixed effect model on longitudinal data where the outcome is measured every two year (with some missing values) and want to use multiple imputation for a number of predictors measured at baseline, would it be okay not to include the outcome during multiple imputation? FH: No, you must include the outcome. Sometimes I time-bin the outcome, e.g. add these variables to the imputation model: Y1 = measurement closest to one year post baseline, Y2 = measurement closest to 2 years post baseline, etc.
11. Interesting materials by Lee et al. (DOI:Redirecting) and by Carpenter et al. (https://doi.org/10.1002/bimj.202000196) about missing data. FH: You may want to add this to the supplemental reading list.
12. I am proposing a 3-yr longitudinal cohort screening study using {clinical, blood, echo, QOL} data collected semiannually or annually to predict outcome pted during screening if called for clinically or at the end of follow-up if risk is sufficiently high to justify conducting it (using a risk level a bit lower than the clinical threshold). Thus, the outcome is binary {Tested/D+, Tested/D- } plus NA. NA proportion could be high. But based on predicted values could yield ordinal outcome, as in radiology: {Tested/Present, Unknown/Likely, Unknown/Unlikely, Tested/Absent } to reduce NA. Viable? Alternatives?
13. Are there optimal methods to perform imputation for ordinal variables when the purpose is not prediction in itself (such as confirmatory factor analysis). Or say in the case where in a structural equation model the outcome is a latent factor obtained from those ordinal indicators? Not an area I’ve studied, but sometimes we treat an ordinal variable as interval scaled when using PMM and things still work.
14. For multiple imputation done with “mice” package, we can specify the methods to impute each variable such as “pmm” “logreg” and “polyreg”. “pmm” was introduced in the lecture and the sample code wasn’t specifying the methods. I wonder if it is better to treat binary and categorical variables with “pmm” in MI or to specify them differently? PMM is still my preferred default but polytomous logistic model should work well for categorical variables being imputed. Just runs slower.
15. I wonder if you know how to pull FMI values for each imputed variables after the multiple imputation was done and before we fit the final model with “mice” package?
16. The”Summary and Rough Guidelines” for missing data provides recommendations based on f (proportion of observations having any variables missing). What is the most efficient way to calculate that? Is there a function in Hmisc to calculate f?

I have a model built using multiple imputation on one dataset (let’s call it “old_dat”). I am looking to make predictions on a new dataset from this model (“new_dat”) to assess whether there was balance between 2 treatment arms in terms of prognostic factors. old_dat contained no interactions. How should I handle the missing data in this new cohort? I can think of 3 options:

1. Impute median values of missing covariates since I am only looking to assess balance between the treatment arms and therefore am not interested in calibration so much as discrimination
2. Create new imputation model using new_dat alone and then make predictions from old_dat onto new_dat
3. Stack old_dat and new_dat, make new imputation model to impute missing new_dat values, but then still use model originally trained on old_dat to make predictions on new_dat?
Which do you think is preferred (if any)?
Thanks.

These are all viable solutions. Using the multiple imputation model developed in the first sample is what I usually recommend but if missingness patterns are different then you could argue to build the imputation model from just the second sample. But to your original goal I would look at a simpler solution: Compute the proportion of missing values by treatment and look at the full distributions of non-missing values by treatment.