How to get the unstable variable selection profile of stepwise and lasso during bootstrap?

Orson · April 12, 2020, 7:13am

I hope to get some R examples on showing the frequency of each candidate factor being selected by backward stepwise and lasso during bootstrap.

I know in rms::validate function with backward method, the print can show “frequencies of numbers of factors retained”. Then, is there a way to show or calculate the frequency of each candidate factor being selected in rms::validate?

I also know the boot.stepAIC() focused on this, but it takes quite a very long time and most suitable for using AIC as criteria.

As for lasso, I haven’t found examples on this topic.

Hope to get some examples or links.

scboone · April 12, 2020, 12:16pm

I’m not really sure if a package or separate function that does what you want exists. I happen to have tried to do something like this myself some time ago and ran into the same problem. I then wrote a very simple bootstrap function that performs the original fit and the bootstrap for it (relies on the glmnet package). I won’t post all the code here as it’s quite ugly and written very specifically for my own dataset and needs at that moment, but the part that you would need looked somewhat like this:

 # Requires as input a prediction and output matrix (pred.mat and out.mat) as well as a lamba grid over which to use the glmnet functions 
 lasso_bootstrapper <- function(pred.mat, out.mat, lambda.grid) {

      # Calculates lamba.min (which I used in my analysis)
      lasso_lamda_min <- cv.glmnet(x = pred.mat, y = out.mat, lambda = lambda.grid, alpha = 1)$lambda.min

      # Performs the original fit on the complete data. 
      # Note: I was working with a continous outcome, hence family = "gaussian". Change this if necessary
      orig.fit <- glmnet(x = pred.mat, y = out.mat, family = "gaussian", lambda = lambda.grid, alpha = 1)
      # Extracts the coefficients from the fit on the full dataset
      coef.orig <- predict(orig.fit, s = lasso_lamda_min, newx = pred.mat, type = "coefficients")
 
      # Dataframe for storing results from the bootstrap. Change ncol according to the number of predictors/variables in your analysis
      boot.res <- data.frame(matrix(nrow = B, ncol = nrow(coef.orig)) 

      for(i in 1:B) {
           # Create indicators for sub-sampling the original dataframe
           boot.train <- sample(n.orig, replace = T)
 
           # Calculate lamba.min for the bootstrap sample
           lasso_lamda_boot_min <- cv.glmnet(x = pred.mat[boot.train], y = out.mat[boot.train], lambda = lambda.grid, alpha = 1)$lambda.min

           # Fits the lasso model in the bootstrap sample and extracts coefficients
           boot.fit <- glmnet(x = pred.mat[boot.train,], y = out.mat[boot.train], family = "gaussian", lambda = lambda.grid, alpha = 1)
           coef.boot <- predict(boot.fit, s = lasso_lamda_boot_min , type = "coefficients")

           # Stores coefficients from the fit in the bootstrap sample in boot.res
           boot.res[i,] <- coef.boot
      }
      # Returns the boot.res as output for the function
      return(boot.res)
 }

This will result in a dataframe of all the coefficients in your bootstrap sample. If a variable was not selected it results in a zero if I recall correctly, so using apply you can quickly summarize the selection frequency.

In reply to @f2harrell below: we used it in a context where we first had a well specified basic model and then attempted to test if we could find any improvement when adding in (a large number of) predictors from a -omics platform. The eventual decision about their usefullness was based on actual increments in predictive performance, in large part inspired by your blogpost. The added performance was (almost) nothing and this was just one small analysis to get a better idea of what was going on in the background.

f2harrell · April 12, 2020, 12:50pm

Bootstrap variable selection frequency should only be used to show what a bad idea variable selection is, i.e., how unstable it is. The selection frequency is so tightly correlated with the original p-values that it is not telling you anything useful as for as formulating an example model.

Orson · April 12, 2020, 1:18pm

Yes I agree. I think the validate function already showed an unstable profile of variable selection. It’s just when I explain, I got asked about the frequency of each candidate variable been retained.

Orson · April 12, 2020, 1:42pm

Thank you for your detailed example @scboone.

f2harrell · April 12, 2020, 2:52pm

That frequency is almost irrelevant. What’s important is how often was the exact sample model selected?