Challenges to variable selection

Variable/predictor/feature selection methods are notorious for instability and poor overlap between “selected” features and “real” features. They also have significant sample size requirements in order to be stable and mostly correct in terms of the set of selected features. As shown in this presentation, the lasso has too low a probability of selecting the “right” features and too high a probability of selecting the “wrong” features even under the ideal case where predictors are uncorrelated and the true unknown regression coefficients have a Laplace (double exponential) distribution centered at zero:

A key problem with feature selection methods is they do not inform the analysis of the set of features for which we don’t know enough to select or reject them. The following approximate Bayesian procedure may help in this regard, and expose whether the sample size is adequate for the task. This is a simplified univariable screening approach that would be most appropriate when predictors are uncorrelated with each other.

  • Assume the candidate features are all scaled similarly and have linear effects on the outcome
  • Assume independent prior distributions for the candidate feature regression coefficients. The priors can be skeptical of large effects and may even place a spike of probability mass at zero effect.
  • Assume for the moment that we are only interested in selecting features that have a positive association with Y (if not, replace the posterior probabililty calculation with the probability that |\beta_{j}| > 0.2 for example)
  • Fit a model with each feature as the only predictor, and repeat for all candidate predictors
  • For each predictor j compute the posterior probability (PP) that \beta_j > 0
  • Consider a predictor to be selected when PP > 0.9 and rejected when PP < 0.1
  • Count the number of candidate features for which PP is between 0.1 and 0.9
  • If the fraction of candidates falling into the uncertainty interval exceeds some number such as 0.25, conclude that the sample size is inadequate for the task

I’m interested in comments about the usefulness/reasonableness of this procedure. In my mind this is more useful than computing the false discovery probability. It quantifies what we don’t know.


I’m curious what advantages this procedure might have over fitting a multivariable model including all predictors and using a shrinkage prior such as the horseshoe?
Secondly, is there a risk that confounding/masking in univariable models may lead to us selecting the wrong predictors?

1 Like

FItting a multivariable model with smart choice of penalty will be better, and the Bayesian posterior probabilities can still be computed. I was presenting an oversimplified approach. On your second question, the point is that feature selection methods select the wrong features a large portion of the time even under ideal situations. When you have collinearity, confounding, etc., things are worse. I’m trying to quantify when to give up on feature selection, i.e., when the information content in the data is insufficient for the task.


If the features are of similar “types” (like in omics data) what do you think of a slightly modified approach to the above where its not quite a multivariable model but essentially a univariate model that pools the estimate of the coefficients:

X_j represents each feature. Y is the same outcome (n x 1). p is # of features, n is # of samples.

  1. Repeat the Y vector p times in the columns to form an n x p matrix Y

  2. Set a global prior for all the B ~ Laplace(0,scale)

  3. Set individual priors B_j ~ Laplace(B, scale2)

  4. Intercept prior: b0 ~ Normal(0,sd)

  5. Fit the model Y = b0 + (B_j^T)* X where * is element wise multiplication (not matrix mult). B_j is p x 1 so B_j^T is 1 x p, and X is n x p.

Essentially this would result in p univariate models, but the B_j would be pooled whereas the method proposed in the OP is still fully separate models. Wouldn’t this pooled approach be more efficient use of the information in the data, while retaining similar interpretation of a univariate model? And also, the partial pooling would induce some correlation between the estimated B_j’s which seems more realistic.

Joint modeling is usually better, and the same probabilities can be computed as I stated for the repeated univariate case. For joint models I suggest using standard notation and putting horseshoe or other reasonable priors on the \beta s.