Publications explaining why automated variable selection procedures produce biased results


#1

I am sometimes consulted by my peers (MDs) regarding statistical and methodological advice for their research projects. I always suggest they consult a real statistician. For logistical reasons, that advice is not very helpful. I often end up helping in some way. I’m in an unfortunate transition phase in my applied statistical skills, in which I have read enough to know that some approaches consistently produce biased results, but I am unable to provide a convincing argument against using them.

Could the datamethods.org community provide links to papers which address common statistical errors and systematically explain why they produce biased results? This would be very helpful in general. Specifically, I am repeatedly confronted with researchers who would like to use automated variable selection procedures and have trouble convincing them that it is a bad idea.


#2

Since this is not specifically health related and is a general statistical issue, I suggest you post this question on stats.stackexchange.com. You’ll see that this issue has been extensively covered on that site. Also look at the stepwise faq. The jury is no longer out on this one.


#3

I very much sympathize with with Elias. Such analyses continue to be very commonly published in the medical literature. Given newly published examples from respected colleagues, it is indeed difficult to provide convincing arguments otherwise.

Arguments invoking technical concerns, i.e. biased R-squared values, are particularly ineffective.

Has anyone in the group found intuitive arguments or apt analogies that are more convincing to non-technical audiences?

For example, I recently attended a journal club where the following paper was discussed:

Shenoi RP, Allahabadi S, Rubalcava DM, Camp EA. The Pediatric Submersion Score Predicts Children at Low Risk for Injury Following Submersions. Acad Emerg Med [Internet] 2017;24(12):1491–500. Available from: http://dx.doi.org/10.1111/acem.13278

Shared full-text:
https://paperpile.com/shared/Kyf8O4

Excerpt from methods section follow:

Variables were considered for further analysis if the p-value was
≤ 0.25, the percentage missing were < 30% and if the information was readily available to the physician. Selected variables were then inputted into a binary regression model, using a backward-step approach; only those factors that had a p-value < 0.05 were considered in generating the submersion score. To assess how well the predictors fit the model, the Hosmer-Lemeshow, chisquare, and p-value were calculated. Additionally, to further reduce overfitting the model, a penalized regression (LASSO) was also conducted and compared to the stepwise analysis. Final predictors for the submersion score were based on both model types.


#4

people seem to be persuaded by guidelines more than anything else (maybe because they distill much of the stats literature into concise guidance), thus you could quote EMA, adjustment for baseline covariates: “Methods that select covariates by choosing those that are most strongly associated with the primary outcome (often called ‘variable selection methods’) should be avoided” That’s pretty unequivocal, although they are discussing RCTs / Edit: the paper you reference, they are developing a risk score, that’s quite different


#5

How can one methods section make more mistakes than that one? Free of statistical principles.


#6

That’s unfortunately what I am dealing with, minus performing LASSO and comparing the two methodologies. Part of the reason I am not making much headway using statistical arguments is that population I am collaborating with does not have much of a statistical background. That is what triggered this post - i.e. the search for practical publications that show using case-studies why certain common statistical errors will result in the wrong result. The argument would use domain specific intuition to explain statistical concepts.


#7

Simple simulations as in my book and course notes are perhaps best. Some links are here.