So basically I have a good amount of independent variables and a binary response. I then wish to reduce the number of variables, especially because there are high collinearity between many of the IV’s. In my field I’ve seen people use regular LASSO to do this, but from what I can read, regular LASSO is not very good when there is multicollinearity. I’ve then read about sparse LASSO, like fx from this package:
Out of the box it seems to perform good, but one thing I have noticed is when using CV for choosing optimal parameters, I get a few differences depending on the seed I’m using. So sometimes I get x-amount of variables suggested, and sometimes one or two extra is added - again, depending on the seed.
So how do I “adjust” for this ? Do I run through fx 100 different/random seeds, and see how many features are most abundant, and adjust the final CV calculation to not include more than that amount of non-zero variables (this can be done is this package).