Publications explaining why automated variable selection procedures produce biased results

Elias_Eythorsson · October 6, 2018, 6:54pm

I am sometimes consulted by my peers (MDs) regarding statistical and methodological advice for their research projects. I always suggest they consult a real statistician. For logistical reasons, that advice is not very helpful. I often end up helping in some way. I’m in an unfortunate transition phase in my applied statistical skills, in which I have read enough to know that some approaches consistently produce biased results, but I am unable to provide a convincing argument against using them.

Could the datamethods.org community provide links to papers which address common statistical errors and systematically explain why they produce biased results? This would be very helpful in general. Specifically, I am repeatedly confronted with researchers who would like to use automated variable selection procedures and have trouble convincing them that it is a bad idea.

f2harrell · October 6, 2018, 8:56pm

Since this is not specifically health related and is a general statistical issue, I suggest you post this question on stats.stackexchange.com. You’ll see that this issue has been extensively covered on that site. Also look at the stepwise faq. The jury is no longer out on this one.

Dale_Steele · October 7, 2018, 12:29pm

I very much sympathize with with Elias. Such analyses continue to be very commonly published in the medical literature. Given newly published examples from respected colleagues, it is indeed difficult to provide convincing arguments otherwise.

Arguments invoking technical concerns, i.e. biased R-squared values, are particularly ineffective.

Has anyone in the group found intuitive arguments or apt analogies that are more convincing to non-technical audiences?

For example, I recently attended a journal club where the following paper was discussed:

Shenoi RP, Allahabadi S, Rubalcava DM, Camp EA. The Pediatric Submersion Score Predicts Children at Low Risk for Injury Following Submersions. Acad Emerg Med [Internet] 2017;24(12):1491–500. Available from: http://dx.doi.org/10.1111/acem.13278

Shared full-text:
https://paperpile.com/shared/Kyf8O4

Excerpt from methods section follow:

Variables were considered for further analysis if the p-value was
≤ 0.25, the percentage missing were < 30% and if the information was readily available to the physician. Selected variables were then inputted into a binary regression model, using a backward-step approach; only those factors that had a p-value < 0.05 were considered in generating the submersion score. To assess how well the predictors fit the model, the Hosmer-Lemeshow, chisquare, and p-value were calculated. Additionally, to further reduce overfitting the model, a penalized regression (LASSO) was also conducted and compared to the stepwise analysis. Final predictors for the submersion score were based on both model types.

PaulBrownPhD · October 7, 2018, 9:25pm

people seem to be persuaded by guidelines more than anything else (maybe because they distill much of the stats literature into concise guidance), thus you could quote EMA, adjustment for baseline covariates: “Methods that select covariates by choosing those that are most strongly associated with the primary outcome (often called ‘variable selection methods’) should be avoided” That’s pretty unequivocal, although they are discussing RCTs / Edit: the paper you reference, they are developing a risk score, that’s quite different

f2harrell · October 8, 2018, 5:59pm

How can one methods section make more mistakes than that one? Free of statistical principles.

Elias_Eythorsson · October 8, 2018, 6:20pm

That’s unfortunately what I am dealing with, minus performing LASSO and comparing the two methodologies. Part of the reason I am not making much headway using statistical arguments is that population I am collaborating with does not have much of a statistical background. That is what triggered this post - i.e. the search for practical publications that show using case-studies why certain common statistical errors will result in the wrong result. The argument would use domain specific intuition to explain statistical concepts.

f2harrell · October 8, 2018, 7:40pm

Simple simulations as in my book and course notes are perhaps best. Some links are here.

ADAlthousePhD · April 24, 2019, 7:09pm

I know this is an old topic, but I reviewed a paper with this issue today and compiled a few references in my comments to the authors, some of which are relevant here.

Smith G. Step away from stepwise. Journal of Big Data 2018.

Heinze G, Dunkler D. Five myths about variable selection. Transplant International 2017; 30(1): 6-10.

ncbi.nlm.nih.gov

Five myths about variable selection.

G Heinze and D Dunkler, Transplant international : official journal of the European Society for Organ Transplantation, Jan 2017

Multivariable regression models are often used in transplantation research to identify or to confirm baseline variables which have an independent association, causally or only evidenced by statistical correlation, with transplantation outcome. Although sound theory is lacking, variable selection is a popular statistical method which seemingly reduces the complexity of such models. However, in fact, variable selection often complicates analysis as it invalidates common tools of statistical inference such as P-values and confidence intervals. This is a particular problem in transplantation research where sample sizes are often only small to moderate. Furthermore, variable selection requires computer-intensive stability investigations and a particularly cautious interpretation of results. We discuss how five common misconceptions often lead to inappropriate application of variable selection. We emphasize that variable selection and all problems related with it can often be avoided by the use of expert knowledge.

van Smeden M, Moons KG, de Groot JA, Collins GS, Altman DG, Eijkemans MJ, Reitsma JB. Sample size for binary logistic prediction models: beyond events per variable criteria. Stat Methods Med Res 2018 (epub).

ncbi.nlm.nih.gov

Sample size for binary logistic prediction models: Beyond events per variable criteria.

M van Smeden, KG Moons, JA de Groot, GS Collins, DG Altman, MJ Eijkemans and JB Reitsma, Statistical methods in medical research, Jan 2018 01

Binary logistic regression is one of the most frequently applied statistical approaches for developing clinical prediction models. Developers of such models often rely on an Events Per Variable criterion (EPV), notably EPV ≥10, to determine the minimal sample size required and the maximum number of candidate predictors that can be examined. We present an extensive simulation study in which we studied the influence of EPV, events fraction, number of candidate predictors, the correlations and distributions of candidate predictor variables, area under the ROC curve, and predictor effects on out-of-sample predictive performance of prediction models. The out-of-sample performance (calibration, discrimination and probability prediction error) of developed prediction models was studied before and after regression shrinkage and variable selection. The results indicate that EPV does not have a strong relation with metrics of predictive performance, and is not an appropriate criterion for (binary) prediction model development studies. We show that out-of-sample predictive performance can better be approximated by considering the number of predictors, the total sample size and the events fraction. We propose that the development of new sample size criteria for prediction models should be based on these three parameters, and provide suggestions for improving sample size determination.

Riley RD, Snell H, Ensor J, Burke DL, Harrell FE, Moons KG, Collins GS. Minimum sample size for developing a multivariable prediction model: PART II – binary and time-to-event outcomes. Stat Med 2019; 38(7): 1276-1296.

ncbi.nlm.nih.gov

Minimum sample size for developing a multivariable prediction model: PART II - binary and time-to-event outcomes.

RD Riley, KI Snell, J Ensor, DL Burke, FE Harrell, KG Moons and GS Collins, Statistics in medicine, Mar 30 2019

When designing a study to develop a new prediction model with binary or time-to-event outcomes, researchers should ensure their sample size is adequate in terms of the number of participants (n) and outcome events (E) relative to the number of predictor parameters (p) considered for inclusion. We propose that the minimum values of n and E (and subsequently the minimum number of events per predictor parameter, EPP) should be calculated to meet the following three criteria: (i) small optimism in predictor effect estimates as defined by a global shrinkage factor of ≥0.9, (ii) small absolute difference of ≤ 0.05 in the model's apparent and adjusted Nagelkerke's R2 , and (iii) precise estimation of the overall risk in the population. Criteria (i) and (ii) aim to reduce overfitting conditional on a chosen p, and require prespecification of the model's anticipated Cox-Snell R2 , which we show can be obtained from previous studies. The values of n and E that meet all three criteria provides the minimum sample size required for model development. Upon application of our approach, a new diagnostic model for Chagas disease requires an EPP of at least 4.8 and a new prognostic model for recurrent venous thromboembolism requires an EPP of at least 23. This reinforces why rules of thumb (eg, 10 EPP) should be avoided. Researchers might additionally ensure the sample size gives precise estimates of key predictor effects; this is especially important when key categorical predictors have few events in some categories, as this may substantially increase the numbers required.

ADAlthousePhD · April 24, 2019, 7:11pm

The single passage that I would choose is something to the effect of “Standard statistical tests assume a single test of a pre-specified model and are not appropriate when a sequence of steps is used to choose the explanatory variables. The standard errors of the coefficient estimates are underestimated, which makes the confidence intervals too narrow, the t statistics too high, and the p-values too low – which leads to overfitting and creates a false confidence in the final model.” (from the Smith article) followed by something like “Models built using stepwise selection (especially in small datasets) have an increased risk of capitalizing on chance features of the data, which will often fail when testing on a different dataset than the one used to build the model.”