I frequently get stuck on the recommendation to 1) multiply impute in the presence of missing data and 2) account for all uncertainty in the variable selection/model building steps via the bootstrap (or maybe cross-validation), but that there is really no “gold-standard” way to do these two things in tandem.

I came across this paper in a recent lit review and it seems helpful, but I also don’t know enough to critically evaluate it

I’m aware of other several simulation studies that recommend approaches similar to the article above such as stacking or grouping, but am curious whether anyone has a “favorite” reference on this topic?

Does anyone have a reference for best practice using propensity scores? This is new to me, but we do work with high-dimensional data, where adjusting for other variables would be highly desirable.

There was a thread on this a few months ago with some good references in the comments.

In a different forum, a practicing statistician indicated to me that bootstrapping, then imputation was generally best. There do exist times when it is possible to get comparable results (with less computational cost) doing imputation first, then bootstrapping, but care needs to be exercised to make sure the variance is not an under-estimate.

Perhaps I am over-complicating it or missing the point completely, but I think the references provided (though helpful!) are largely based on bootstrapping 95% CI for inference, whereas I am interested in using the bootstrap for internal validation to estimate optimism of model performance.

Intuitively, I am struggling to wrap my head around how bootstrapping first and then imputing would work in a prediction context, especially if one wants to somehow build in the uncertainty of model selection into the bootstrap process. I need to give this some more thought.

I understood from today’s lecture that one cannot look at the distribution of Y prior to analysis in order to determine an appropriate transformation for the model (e.g. log, sqrt). Can you clarify why? Does this use up “phantom degrees of freedom” if it is not informing my X?
In practice, how would you recommend the implementation of Gls without knowing the distribution of Y? If in reality Y was heavily skewed, wouldn’t the results be accurate? Am I misunderstanding something?
Thanks,

I have a question about the item 3 & 4 in 4.12.1 Developing Predictive Models.

Assume we collect the data about X and Y in different time or we collect several waves of X. Do we need to consider the potential calender time issue in the multiple imputation process?

Actually, I did not understand the spirit of using X and Y to imputate X and Y becasue we avoid using the information from Y in the data reduction of X. Following this spirit, we might also avoid using the inforamtion from Y to imputate X?

Thank you so much and wish you have a great night!

We have talked about categorization/dichotomizing a continuous variable is a not a good idea. I’m wondering what would be a good way to handle age variable when we have de-identified data where we don’t know the exact ages of people that are older than 90 years old? Using age as an ordinal variable in this case does not seem like a good idea as we would lose information. Is there a good way to handle this situation? Would using multiple imputation for ages that we don’t exactly know be a good idea in this case?

Actually, my team is thinking about integrating model approximation into calibrate and validate by redoing model selection for each bootstrap so any information about the problems you or your student ran into (or speculated) would be valuable! Thanks!

Anything’s possible! As things stand now you would just impute the outcome then impute the X’s. Of use one of the other methods here.

At first glance the paper looks useful.

I assume that if you are working in high dimensions that your number of subjects must be large. Then you may not need propensity scores but could directly adjust for covariates.

It’s a touch question. You can’t look “too much”. This relates to model uncertainty, e.g., your final confidence bands don’t know about the earlier abandoned transformations. The Faraway paper covers this. This relates to my quote form the initial part of the course: using the data to guide model specification is almost as dangerous as not doing so.

It is required to use Y to impute X. Regarding calendar time, this would be an ideal variable to include in the imputation model, and is a necessary one in all likelihood.

A very good question. I’ve taken the easy way out and just leave them at age 90. It would be a little better to impute to ages above 90 but you would then have to make a parametric distributional assumption because you would be extrapolating.

Hi @f2harrell,
I couldn’t find the reference about propensity scores as a penalised method in your list of papers. You said you have a list of papers for observational studies. Can you please point me to the link you mentioned. Thank you!

Thanks, @f2harrell, for referring others to this discussion. I appreciate @PerPersvensson’s comments, suggesting that maybe there’s something else going on.

In the course, you mentioned in a repeated-measure study, baseline Y should be a covariate. How would you proceed if there is daily dosing and samples are taken several times within a day at different days. e.g. you have Day1 predose, 4hr, 10hr, and 24hr and Day20 predose, 4hr, 10hr, and 24hr. Intuitively I would’ve let predose be in the RHS and allow interaction between hr (continuous spline) and Day (categorical). But if Day1 predose is an adjustment variable, how will we treat Day20 predose?

I don’t know of a paper that has laid this out. This is akin to the fact that you can write principal component analysis as a kind of penalized estimation. I wish I had more.

A great question. I don’t have experience with that “new baseline” design. It may be covered in @Stephen 's amazing Statistical Issues in Drug Development book or in Steve Piantadosi’s clinical trials book.

If I am modeling data across multiple clinical trials, would it make sense to perform CV to validate allowing me to leave out one trial at a time as a test set (in this context it wouldn’t be repeated CV), or would bootstrap still be preferred? If CV is the preferred approach in this context, is there still a way to adjust for trial in this model? Should I assume a compound symmetry correlation matrix for each trial?

Trial may be too large of a unit for the bootstrap to work well. If you can treat trials as exchangeable (as done in most meta-analysis; same as compound symmetric) you can use random effects for trials.

Thank you very much for your response. Is there a reason you suggest to treat trial as a random effect instead of imposing a compound symmetry correlation matrix for trial via Gls? I was under the impression from the course that gls is a more preferred approach? Also, is there a function in rms package which allows for random effects?