I have data on 140 subjects (70 cases and 70 controls) and 200 features (biomarkers). I do not have independent data for external validation.
My aim is to build a prediction model and to internally evaluate its performance through the bootstrap following the approach reported in Regression Modeling Strategies by FE Harrell, 2nd edition. In the book example at page 259 there were only two predictors (age and sex), thus standard logistic regression was used.
The very first step of the validation procedure proposed comprises to fit a logistic model on the original sample and then to compute some quantities such as those reported in the table at page 260 of the book (e.g. intercept and calibration slope, Somers D, etc). The intercept and calibration slope are obtained fitting another logistic model to the original sample including, as covariate, the estimated linear predictor obtained from the previously fitted model including age and sex. Performing this operation you have to obtain an intercept value of 0 and a slope value of 1.
Now, suppose to repeat this same simple step but using in place of standard logistic regression, elastic-net. Thus, I fit the elastic-net model to the original data (for simplicity I set alpha=0.5 and I choose lambda through ten-fold cross-validation - cv): cv.glmnet(x, y, alpha = 0.5, family = “binomial”).
Then I compute the linear predictor and I fit a standard logistic model on the original data including the linear predictor as covariate as described above. In this case the itnercept is different from 0 and the slope is different from 1.
My questions are:
- am I doing something wrong?
- using elastic-net, should the intercept and slope obtained as described above using the original data, be 0 and 1 also in this case?
- performance measures reported in the book, especially those evaluating calibration, can still be used when using variable selection algorithms such as, elastic-net?