Choosing penalty parameters for elastic net

bobmcpop · November 11, 2018, 1:57pm

I’m trying to develop a classification model with >70 variables and training set of only 500 (case prevalence 10%). I’ve used cross-validation before to choose the penalty (lamba) for lasso, similar to what’s described here. But often there were too few variables selected. It was suggested to me that elastic net is worth trying.

But now there are two tuning parameters [alpha (mixture of lasso/ridge) and lamba]. How do I choose both at the same time, are there any recommended steps?

For the first time (!) @f2harrell 's BBR did not have the answer for me.
Capture
Thank you

f2harrell · November 11, 2018, 4:09pm

Like lasso, elastic net is not used for classification but rather for estimation, e.g., risk estimation.

Your effective sample size is about 50 so the number of candidate features that you can reliably analyze is in the range of 1-10. With > 70 candidate features the analysis is doomed. For example, if you bootstrap the entire process to check the stability of selected features, you’ll find there is none.

As you’ve found, when it is possible to use elastic net (say if your sample size were much, much larger), it is tricky to simultaneously choose 2 penalty parameters. Many people take the easy shortcut that the ratio of the two penalties is fixed, and the software invites you to do this. I know of no theory that backs up that procedure.

I suggest using severe data reduction (e.g., sparse principal components) followed by ordinary modeling of the 1-5 summary scores coming out of that.

bobmcpop · November 11, 2018, 6:54pm

Thank you! I had previously used lasso to predict each individual’s probability of being a “case”, and then classify them using a probability threshold derived from a desired specificity.

For example, if you bootstrap the entire process to check the stability of selected features, you’ll find there is none.

I came across this exact problem, so instead surrogate-labeled the much larger original dataset (n>15k, from which the aforementioned 500 was randomly selected) to first reduce the number of variables, before applying them to the 500 gold-standard training set. So two cycles of feature selection (?validity using this approach).

Many people take the easy shortcut that the ratio of the two penalties is fixed, and the software invites you to do this. I know of no theory that backs up that procedure.

The sparsity of literature on this made me question using elastic net too. I was surprised that such a well-established, well-used and frequently taught method had such a glaring “gap” in it’s procedure, and thought that I was missing something.

f2harrell · November 11, 2018, 9:32pm

This is at odds with what the model does. It gives a direct estimate of a probability. Don’t bring sensitivity, specificity, or thresholds into the procedure.

Such a reduction of the number of variables must use only unsupervised learning. Using Y to drive the variable selection, when this selection is “outside” of the penalized modeling, will result in severe bias.

MarkvandeWiel · September 6, 2022, 9:06am

Late reply, but if you’re still interested: here’s a blog post that explains why indeed it is tricky to tune both hyperparameters of the elastic net simultaneously.

f2harrell · September 6, 2022, 11:08am

Very nice article. Sorry you would voted down on stats.stackexchange.com for pointing to your own article. I’ve never been voted down for doing that.

A few random thoughts:

The 3-d plot showing how hard it is to pick a combination of two penalty parameters is the kind of display we should see more often
There is some recent work showing sample sizes needed to estimate a single penalty parameter (wish I had a reference). It would be worthwhile to estimate the sample size for adequately estimating two penalty parameters.
A good solution is probably to use ridge regression with Bayesian model projection: Projection predictive variable selection – A review and recommendations for the practicing statistician
I don’t recommend using lasso

scboone · September 6, 2022, 3:53pm

Perhaps not quite the one you were looking for, but this one does a nice illustration of sample size on the variability of performance estimates from shrinkage methods too:
https://journals.sagepub.com/doi/abs/10.1177/0962280220921415

Why specifically would you not recommend lasso? Because it penalises estimates to 0 which can lead to arbitrary variable selection?

f2harrell · September 6, 2022, 4:58pm

Yes; the variables selected by lasso are entirely unreliable unless you have an enormous sample size and the predictors are uncorrelated with each other. More here.

MarkvandeWiel · September 6, 2022, 7:09pm

Thanks for the nice words and suggestion. Indeed (Bayesian) ridge + posterior selection can be a very good alternative. Some use stability selection (by Meinshausen et al.) as a bootstrapping based technique to improve variable selection with lasso.

As for sample size: increasing sample size will of course help in terms of more stable regression coefficients and variable selection. However, increasing sample size does not overcome the identifiability issue of the two penalty parameters. This seems, unfortunately, rather intrinsic.

MarkvandeWiel · September 6, 2022, 7:42pm

And about the down-voting: apparently the rule is that when referring to one’s own blog, one should be very clear about that. That’s what I did now. Also they prefer to have at least part of the argumentation in the reply. Not unreasonable, so I added that as well.