# Validating elastic-net model with binary outcome

Dear all,

I have data on 140 subjects (70 cases and 70 controls) and 200 features (biomarkers). I do not have independent data for external validation.

My aim is to build a prediction model and to internally evaluate its performance through the bootstrap following the approach reported in Regression Modeling Strategies by FE Harrell, 2nd edition. In the book example at page 259 there were only two predictors (age and sex), thus standard logistic regression was used.

The very first step of the validation procedure proposed comprises to fit a logistic model on the original sample and then to compute some quantities such as those reported in the table at page 260 of the book (e.g. intercept and calibration slope, Somers D, etc). The intercept and calibration slope are obtained fitting another logistic model to the original sample including, as covariate, the estimated linear predictor obtained from the previously fitted model including age and sex. Performing this operation you have to obtain an intercept value of 0 and a slope value of 1.

Now, suppose to repeat this same simple step but using in place of standard logistic regression, elastic-net. Thus, I fit the elastic-net model to the original data (for simplicity I set alpha=0.5 and I choose lambda through ten-fold cross-validation - cv): cv.glmnet(x, y, alpha = 0.5, family = “binomial”).
Then I compute the linear predictor and I fit a standard logistic model on the original data including the linear predictor as covariate as described above. In this case the itnercept is different from 0 and the slope is different from 1.

My questions are:

1. am I doing something wrong?
2. using elastic-net, should the intercept and slope obtained as described above using the original data, be 0 and 1 also in this case?
3. performance measures reported in the book, especially those evaluating calibration, can still be used when using variable selection algorithms such as, elastic-net?

Instead of putting the linear predictor from elastic net into an ordinary logistic regression, I think it may be necessary to use 100 repeats of 10-fold cross-validation (not the usual bootstrap because the bootstrap is too optimistic in the extreme case there N < p) of the whole procedure, which will also expose the instability of elastic net feature selection.

See if the following explains some of what you are seeing. When you fit an ordinary logistic model, the linear predictor will be “too wide” because of overfitting, and the slope of the linear predictor on a new dataset will be less than 1 (the calibration shrinkage coefficient). When you use penalization (ridge, lasso, elastic net, …) there is purposeful underfitting and the slope on a new sample will be less than 1.

I left, for the moment, the bootstrap aside and I have fitted 3 models and validated them through stratified 10-fold cross-validation (from scratch): lasso, elastic-net with alpha=0.5, elastic-net with alpha=0.7.
Just to remember, my dataset has 131 subjects. Thus, test folds include between 12 to 14 subjects each.
Here are my results:
lasso elastic elastic2
prederror 0.2719780 0.2510073 0.2730769
auc 0.7926304 0.7858844 0.7933107
intercept 1.8525283 3.2858013 2.0691497
slope 19.6097969 35.5361460 23.6887184
scaledbrier 0.1791736 0.1676837 0.1584228
(intercept states for the calibration in the large whereas the slope for the calibration slope)

Note: in the table above, all performance measures are obtained taking the mean of the 10 values obtained through cross-validation.
For the ninth test fold I obtain awful estimates of intercept and slope.

If I take the median value instead, for the intercept and slope, these are the results:
lasso elastic elastic2
intercept 0.04097432 0.1758518 0.09343803
slope 1.40932309 1.5151586 1.58604863

Is this results expected given the small number of subjects in the test folds or there is something patently wrong?

Thanks,

Elisabetta