We are currently working on deriving a clinical prediction rule for transient ischemic attack (TIA) patients in the emergency department. We are debating about a couple of steps and wanted to ask you to confirm.
Trying to follow modern methods including multiple imputations, restricted cubic splines, and selection of predictors, etc and so following the rms. We have a good sample size of about 10,000 patients with about 900 events (good number of events per parameter).
When we run the fastbw() function on a model with 29 potential predictors, suspected to be associated with the outcome based on clinical knowledge, it gives us 14 factors to keep in the final model. These 14 factors are essentially the last most strongly associated predictors from an ANOVA plot (attached). Our question is, do we have to use all the 14 predictors for our clinical prediction rule (a nomogram)? Based on our needs and ease of use (to integrate with an existing clinical decision rule), we want to select only 9 or 10 of the 14 predictors by looking at the ANOVA plot and selecting the 9 or 10 most strongly associated predictors (e.g., predictors “a” to “i” from the attached plot) instead of all 14.
Is this a “no no” or are we okay by selecting a subset of the significant predictors for the clinical prediction rule? i.e. Is there something that says we must use all the predictors kept by the fastbw? In general, for deriving a parsimonious clinical prediction model does one even need to run the fastbw? i.e. can we just rely on the ANOVA plot, plot(anova(fullmodel)) to pick the most strongly associated predictors that we think would be best for the clinical decision rule.
Another question is that we are deriving the model on all TIA patients we have – would it be abnormal to try externally validate (and possibly update) this model on a subset of patients from another study (those at medium risk of stroke based on another already validated prediction model)? To me it would be just a different setting for the validation but wanted to confirm with you.
For your main question, variable selection is problematic unless you severely constrain it to only remove a few variables or your carefully use penalization. So I would not use free-flowing variable selection as you did unless (1) bootstrapping reveals remarkable stability in which variables are selected (this almost never happens) and (2) you use bootstrapping for validating the model taking variable selection fully into account.
A better overall approach is to use unsupervised learning (data reduction). Sparse principal components is an excellent approach, our you can use variable clustering following by regular principal components on the clusters to create one-number summaries of the clusters. Or you can use clinical judgement for forming clusters. These data reduction methods are covered in the 2nd edition of RMS.
Thanks for this information. When you say “remarkable stability” by bootstrapping do you mean how often the variables are selected by bootstrapping? We did 1000 bootstraps and the variable with the least number of times selected is 888 times among the 10 variables we selected. We are validating with bootstrapping using the 10 variables we selected. Also thinking of applying a shrinkage based on the bootstrap validation, but not sure as I’ve read somewhere that shrinkage doesn’t help much and maybe not needed for large sample size as ours.
I will definitely look into the 2nd edition of the RMS for future projects.
It sounds as if you are doing a conditional validation. The bootstrap, in order to work properly and not be overly optimistic, needs to repeat any supervised learning steps (such as variable selection) afresh at each iteration.
888/1000 as a minimum selection frequency is unusual but I guess a byproduct of your large sample size. The bootstrap is not providing anything new; it’s just reflecting the relative \chi^2 contributed by each variable in the original analaysis.
As I’m not a statistician, some of the methodologies are new to me. I might have misunderstood question/answer #18 in the following link (“Are automated variable selection methods justified if the only goal is to identify the strongest predictors? No”), and thought we could subselect some of the strongest predictors from the ANOVA/fastbw() for our clinical decision rule.
The data do not possess sufficient information for reliably informing you which predictors are the most important. Feature selection is a highly unstable thing unless the sample size is enormous and there is little collinearity. This is why I prefer data reduction (unsupervised learning) in most cases. The degrees of freedom issue relates to the number of opportunities you give a predictor in the modeling process. For example if you looked at a scatterplot and that made you exclude a variable, you still had at least one opportunity. If you thinking about a quadratic fit and you ended up making it linear after looking at something, you spend 2 d.f. for that predictor.