Hello there,
I am imputing 17 candidate predictors, as well as 14 non-predictor variables in which model calibration will be assessed across. 9 of the 14 non-predictor variables are just alternate derivations of the candidate predictors (e.g. categorized continuous predictor) or variables used to derive the candidate predictors.
Using the mice package, I imputed all these aforementioned target variables using an imputation model which had the following as predictors: the 17 candidate predictors, the study outcome, and the survey cycle (as 2 candidate predictors not asked in 2 of the cycles).
I realized I didn’t include the 14 non-predictor variables as predictors in the imputation model, despite them being target variables. How big of an issue is this if these 14 variables are not involved in the actual modelling but rather just a small part of model validation? After all, I don’t believe these 14 variables inform missingness in the 17 candidate predictors, especially if 9 of the 14 variables are just alternate derivations of the candidate predictors or variables used to derive the candidate predictors.
Would appreciate any feedback!
Thanks,
Raf
Since you already know that categorized versions of continuous predictors will not fit the true relationships (when they are not flat) well at all, including them in the analysis adds nothing but confusion.
1 Like
Hello all — long-time reader, first-time poster. I’ve benefited greatly from the discussions on this forum over the years and would appreciate your insights on a missing data question related to imputation model construction.
I’m working with a dataset of approximately 2,000 patients undergoing non-cardiac surgery, where the goal is to evaluate associations between preoperative characteristics and postoperative cardiac complications (within 30 days). One key preoperative variable is renal function, measured via serum creatinine concentration. However, about 4% of patients are missing preoperative creatinine values.
Interestingly, a sizable subset of those with missing preoperative creatinine have postoperative creatinine values available, typically measured on postoperative day 0 or 1. Clinically, postoperative creatinine rarely drops significantly below preoperative values, so a normal postoperative creatinine level usually suggests a normal preoperative value. In contrast, a minority of patients do develop postoperative acute kidney injury (AKI), with a marked postoperative increase above baseline.
My question is:
Is it appropriate to include postoperative creatinine concentration in the imputation model for missing preoperative creatinine values?
Importantly, postoperative creatinine is not included in any of the planned primary regression models. However, postoperative AKI (reflected in elevated postoperative creatinine) could plausibly lie on the causal pathway between preoperative risk factors and postoperative cardiac events—raising concerns about conditioning on a mediator.
I would appreciate thoughts on whether including postoperative creatinine in the imputation model would be methodologically sound or risk introducing bias.
Thank you in advance.
2 Likes
Not only is it proper to include follow-up measurements in a model to impute missing baseline variables, it would be incorrect to not do so, resulting in attenuation of regression coefficients from under-imputing.
4 Likes
Thank you for this clear response, Prof. Harrell. It has clarified my analytical plan and confirmed the importance of including follow-up measurements in the imputation model. Much appreciated!
1 Like
Hello Dr. Harrell,
Hope you are doing well. Thank you for your helpful responses as always. Much appreciated!
Just to confirm on a another note, if I am making sex-specific models, would I have to include the stratifier (sex) as an imputation predictor during imputation which occurs before stratification? Would my sex-specific models become invalid/incorrect if I do not?
Thanks,
Raf
1 Like
Great question. I hope someone who reads this has knowledge about that. From what you wrote sex is not officially a stratifier since stratification implies a single model with certain effects relaxed (e.g., proportional hazards for sex). If you mean sex is a subsetting variable over which completely separate models are developed, then you want separate imputations by sex or to include all the sex interactions in the imputation model.
Yes, there will be a male model and a female model. Data subsetted after imputation. What do you think?
Probably OK but you are estimating a lot of parameters in the outcome model by borrowing no information from the other sex.
1 Like
Sounds good. So if I subset by sex after imputing without using sex (the subsetting variable) as an imputation predictor, I would be OK, despite the male model including no information from females and vice versa?
Would information from females matter for the male model after subsetting? After all, within-sex distributions would still be maintained, correct? Thanks.