Choosing penalty parameters for elastic net

penalization
prediction

#1

I’m trying to develop a classification model with >70 variables and training set of only 500 (case prevalence 10%). I’ve used cross-validation before to choose the penalty (lamba) for lasso, similar to what’s described here. But often there were too few variables selected. It was suggested to me that elastic net is worth trying.

But now there are two tuning parameters [alpha (mixture of lasso/ridge) and lamba]. How do I choose both at the same time, are there any recommended steps?

For the first time (!) @f2harrell 's BBR did not have the answer for me.
Capture
Thank you


#2

Like lasso, elastic net is not used for classification but rather for estimation, e.g., risk estimation.

Your effective sample size is about 50 so the number of candidate features that you can reliably analyze is in the range of 1-10. With > 70 candidate features the analysis is doomed. For example, if you bootstrap the entire process to check the stability of selected features, you’ll find there is none.

As you’ve found, when it is possible to use elastic net (say if your sample size were much, much larger), it is tricky to simultaneously choose 2 penalty parameters. Many people take the easy shortcut that the ratio of the two penalties is fixed, and the software invites you to do this. I know of no theory that backs up that procedure.

I suggest using severe data reduction (e.g., sparse principal components) followed by ordinary modeling of the 1-5 summary scores coming out of that.


#3

Thank you! I had previously used lasso to predict each individual’s probability of being a “case”, and then classify them using a probability threshold derived from a desired specificity.

For example, if you bootstrap the entire process to check the stability of selected features, you’ll find there is none.

I came across this exact problem, so instead surrogate-labeled the much larger original dataset (n>15k, from which the aforementioned 500 was randomly selected) to first reduce the number of variables, before applying them to the 500 gold-standard training set. So two cycles of feature selection (?validity using this approach).

Many people take the easy shortcut that the ratio of the two penalties is fixed, and the software invites you to do this. I know of no theory that backs up that procedure.

The sparsity of literature on this made me question using elastic net too. I was surprised that such a well-established, well-used and frequently taught method had such a glaring “gap” in it’s procedure, and thought that I was missing something.


#4

This is at odds with what the model does. It gives a direct estimate of a probability. Don’t bring sensitivity, specificity, or thresholds into the procedure.

Such a reduction of the number of variables must use only unsupervised learning. Using Y to drive the variable selection, when this selection is “outside” of the penalized modeling, will result in severe bias.