RMS Describing, Resampling, Validating, and Simplifying the Model

Good Afternoon Professor Harrell and everyone! It’s awesome to be here. I 've currently making my way through RMS and Flexible Imputation to really round out some material I wasn’t exposed to in undergrad and graduate statistics.

Currently, I am trying to build a diagnostic model on case controlled data. Specifically-I am having issues with some missingness structure, as well as some data that I believe is redundant and needlessly increases the dimensionality of my sample space. I’d like to build a few competing models after imputing and performing redundancy analysis. Our primary goal is a predictive model, so I have been following RMS Chapters 4 and 5 in this regard.

My thoughts on how to proceed without falling into some of the pitfalls that produce over-optimistic models is the two step process below. I have been wrestling with the though that including all steps within the same ‘double bootstrap’ loop is better-but I fell like I need to account for uncertainty in MI + RDA step before any modeling validation and selection is done.

  1. Redundancy Analysis + MI

    1. Create M Imputed Datasets (Using FCS/Additive Regression with PMM as an example)
    2. For Each Data Set, fit flexible parametric additive models to the data (Using redun or the like)
    3. ‘Pool the results’ of the RDA by looking at the relative frequencies the i-th predictor is included in
      the selected set of predictors
    4. Establish a minimum threshold for the frequencies in part 3 to discard from the original data.
      Here I’m a bit confused how to establish a threshold that will generalize well
    5. Discard any variables that do not meat the threshold in the original data
  2. Model Validation and Selection:

    1. Start with the original (non complete) data-but screen out the predictors that were found by step
      (5) above.
    2. For each model i = 1,2…,I:
      1. For m = 1,2…,M:
      2. Fit model i to Imputed Data set M
      3. Bootstrap Validate fit as in 5.3.5 in the online text
      4. Average the M apparent performance estimates over the imputations for model i to get an
        apparent performance for model i, S_{iapp}
    3. Get Generalized Performance Estimates:
      1. For Imputed Data sets 1,2,…,M

        • For resamples j = 1,2,…J
          • For Each Model i = 1,2,…I
            1. Fit model i to the j-th resample of dataset M
            2. Calculate the performance estimate S_{ijm_boot}
            3. Calculate the additional performance estimate S_{ijm_orig}
      2. For each of the M imputed datasets that now have bootstrap estimates of performance:

        • calculate apparent optimism for model i which is (forgive my lack of markdown knowledge 1/j*SUM(S_{ijm_boot} - S_{ijm_orig))
        • Average the apparent optimism for model I over all imputed Data sets for; O_{i}
      3. Calculate Optimism of apparent Imputed Performance Estimates:

        • S_{i_adj} = S_{iapp}- I_{I}

Am I correct in my approach here? Please let me know how I can elaborate more if needed, and please forgive the markdown-I’d be happy to fix if necessary.

For Reference: I have used this post here to guide my thinking.

Thank you so much!