Good Afternoon Professor Harrell and everyone! It’s awesome to be here. I 've currently making my way through RMS and Flexible Imputation to really round out some material I wasn’t exposed to in undergrad and graduate statistics.
Currently, I am trying to build a diagnostic model on case controlled data. Specifically-I am having issues with some missingness structure, as well as some data that I believe is redundant and needlessly increases the dimensionality of my sample space. I’d like to build a few competing models after imputing and performing redundancy analysis. Our primary goal is a predictive model, so I have been following RMS Chapters 4 and 5 in this regard.
My thoughts on how to proceed without falling into some of the pitfalls that produce over-optimistic models is the two step process below. I have been wrestling with the though that including all steps within the same ‘double bootstrap’ loop is better-but I fell like I need to account for uncertainty in MI + RDA step before any modeling validation and selection is done.
-
Redundancy Analysis + MI
- Create M Imputed Datasets (Using FCS/Additive Regression with PMM as an example)
- For Each Data Set, fit flexible parametric additive models to the data (Using redun or the like)
- ‘Pool the results’ of the RDA by looking at the relative frequencies the i-th predictor is included in
the selected set of predictors - Establish a minimum threshold for the frequencies in part 3 to discard from the original data.
Here I’m a bit confused how to establish a threshold that will generalize well - Discard any variables that do not meat the threshold in the original data
-
Model Validation and Selection:
- Start with the original (non complete) data-but screen out the predictors that were found by step
(5) above. - For each model i = 1,2…,I:
- For m = 1,2…,M:
- Fit model i to Imputed Data set M
- Bootstrap Validate fit as in 5.3.5 in the online text
- Average the M apparent performance estimates over the imputations for model i to get an
apparent performance for model i, S_{iapp}
- Get Generalized Performance Estimates:
-
For Imputed Data sets 1,2,…,M
- For resamples j = 1,2,…J
- For Each Model i = 1,2,…I
1. Fit model i to the j-th resample of dataset M
2. Calculate the performance estimate S_{ijm_boot}
3. Calculate the additional performance estimate S_{ijm_orig}
- For Each Model i = 1,2,…I
- For resamples j = 1,2,…J
-
For each of the M imputed datasets that now have bootstrap estimates of performance:
- calculate apparent optimism for model i which is (forgive my lack of markdown knowledge 1/j*SUM(S_{ijm_boot} - S_{ijm_orig))
- Average the apparent optimism for model I over all imputed Data sets for; O_{i}
-
Calculate Optimism of apparent Imputed Performance Estimates:
- S_{i_adj} = S_{iapp}- I_{I}
-
- Start with the original (non complete) data-but screen out the predictors that were found by step
Am I correct in my approach here? Please let me know how I can elaborate more if needed, and please forgive the markdown-I’d be happy to fix if necessary.
For Reference: I have used this post here to guide my thinking.
Thank you so much!