Publications explaining why automated variable selection procedures produce biased results

I am sometimes consulted by my peers (MDs) regarding statistical and methodological advice for their research projects. I always suggest they consult a real statistician. For logistical reasons, that advice is not very helpful. I often end up helping in some way. I’m in an unfortunate transition phase in my applied statistical skills, in which I have read enough to know that some approaches consistently produce biased results, but I am unable to provide a convincing argument against using them.

Could the community provide links to papers which address common statistical errors and systematically explain why they produce biased results? This would be very helpful in general. Specifically, I am repeatedly confronted with researchers who would like to use automated variable selection procedures and have trouble convincing them that it is a bad idea.


Since this is not specifically health related and is a general statistical issue, I suggest you post this question on You’ll see that this issue has been extensively covered on that site. Also look at the stepwise faq. The jury is no longer out on this one.


I very much sympathize with with Elias. Such analyses continue to be very commonly published in the medical literature. Given newly published examples from respected colleagues, it is indeed difficult to provide convincing arguments otherwise.

Arguments invoking technical concerns, i.e. biased R-squared values, are particularly ineffective.

Has anyone in the group found intuitive arguments or apt analogies that are more convincing to non-technical audiences?

For example, I recently attended a journal club where the following paper was discussed:

Shenoi RP, Allahabadi S, Rubalcava DM, Camp EA. The Pediatric Submersion Score Predicts Children at Low Risk for Injury Following Submersions. Acad Emerg Med [Internet] 2017;24(12):1491–500. Available from:

Shared full-text:

Excerpt from methods section follow:

Variables were considered for further analysis if the p-value was
≤ 0.25, the percentage missing were < 30% and if the information was readily available to the physician. Selected variables were then inputted into a binary regression model, using a backward-step approach; only those factors that had a p-value < 0.05 were considered in generating the submersion score. To assess how well the predictors fit the model, the Hosmer-Lemeshow, chisquare, and p-value were calculated. Additionally, to further reduce overfitting the model, a penalized regression (LASSO) was also conducted and compared to the stepwise analysis. Final predictors for the submersion score were based on both model types.


people seem to be persuaded by guidelines more than anything else (maybe because they distill much of the stats literature into concise guidance), thus you could quote EMA, adjustment for baseline covariates: “Methods that select covariates by choosing those that are most strongly associated with the primary outcome (often called ‘variable selection methods’) should be avoided” That’s pretty unequivocal, although they are discussing RCTs / Edit: the paper you reference, they are developing a risk score, that’s quite different


How can one methods section make more mistakes than that one? Free of statistical principles.


That’s unfortunately what I am dealing with, minus performing LASSO and comparing the two methodologies. Part of the reason I am not making much headway using statistical arguments is that population I am collaborating with does not have much of a statistical background. That is what triggered this post - i.e. the search for practical publications that show using case-studies why certain common statistical errors will result in the wrong result. The argument would use domain specific intuition to explain statistical concepts.

1 Like

Simple simulations as in my book and course notes are perhaps best. Some links are here.

1 Like

I know this is an old topic, but I reviewed a paper with this issue today and compiled a few references in my comments to the authors, some of which are relevant here.

Smith G. Step away from stepwise. Journal of Big Data 2018.

Heinze G, Dunkler D. Five myths about variable selection. Transplant International 2017; 30(1): 6-10.

van Smeden M, Moons KG, de Groot JA, Collins GS, Altman DG, Eijkemans MJ, Reitsma JB. Sample size for binary logistic prediction models: beyond events per variable criteria. Stat Methods Med Res 2018 (epub).

Riley RD, Snell H, Ensor J, Burke DL, Harrell FE, Moons KG, Collins GS. Minimum sample size for developing a multivariable prediction model: PART II – binary and time-to-event outcomes. Stat Med 2019; 38(7): 1276-1296.


The single passage that I would choose is something to the effect of “Standard statistical tests assume a single test of a pre-specified model and are not appropriate when a sequence of steps is used to choose the explanatory variables. The standard errors of the coefficient estimates are underestimated, which makes the confidence intervals too narrow, the t statistics too high, and the p-values too low – which leads to overfitting and creates a false confidence in the final model.” (from the Smith article) followed by something like “Models built using stepwise selection (especially in small datasets) have an increased risk of capitalizing on chance features of the data, which will often fail when testing on a different dataset than the one used to build the model.”