Appropriate to do lasso, then standard glm using variables with non-zero coef. from lasso?

Hanis_Kadir · July 2, 2024, 11:18am

My dataset comprises 1300 patient records, featuring 25 variables (expanded to 28 after accommodating categorical variables with multiple levels) carefully selected for their clinical relevance and significance. The outcome variable is dichotomous. To identify the most informative variables and reduce dimensionality, I intend to apply Lasso Logistic Regression. Recognizing that penalized regression does not provide standard errors due to biased estimates, I thought of a two-stage approach:

Perform Lasso Logistic Regression to identify variables with non-zero coefficients.
Utilize the selected variables to fit a standard Logistic Regression model, and obtain the estimated odds ratios and confidence intervals.

Is this appropriate?

f2harrell · July 2, 2024, 11:34am

Though you will get disagreement from statisticians touting “relaxed lasso”, this in my view is highly problematic. The reason to use lasso is to get shrinkage that penalizes you for not knowing which variables to model. To ignore that shrinkage creates a disaster when you assess the absolute predictive accuracy of the model with a nonparametric calibration curve.

This kind of problem will be best dealt with using Bayesian regression with carefully chosen priors using prior knowledge while incorporating shrinkage. The great advantage of Bayes is that you get a plain old posterior distributions when you’re finished. There is no interpretation or inference problem caused by shrinkage as there is in frequentist statistics.

Also beware of the accuracy of the variables selected by lasso. It’s often better to use a data reduction technique followed by ordinary regression on the reduced dimensions.