Challenges to variable selection

Variable/predictor/feature selection methods are notorious for instability and poor overlap between “selected” features and “real” features. They also have significant sample size requirements in order to be stable and mostly correct in terms of the set of selected features. As shown in this presentation, the lasso has too low a probability of selecting the “right” features and too high a probability of selecting the “wrong” features even under the ideal case where predictors are uncorrelated and the true unknown regression coefficients have a Laplace (double exponential) distribution centered at zero:

A key problem with feature selection methods is they do not inform the analysis of the set of features for which we don’t know enough to select or reject them. The following approximate Bayesian procedure may help in this regard, and expose whether the sample size is adequate for the task. This is a simplified univariable screening approach that would be most appropriate when predictors are uncorrelated with each other.

  • Assume the candidate features are all scaled similarly and have linear effects on the outcome
  • Assume independent prior distributions for the candidate feature regression coefficients. The priors can be skeptical of large effects and may even place a spike of probability mass at zero effect.
  • Assume for the moment that we are only interested in selecting features that have a positive association with Y (if not, replace the posterior probabililty calculation with the probability that |\beta_{j}| > 0.2 for example)
  • Fit a model with each feature as the only predictor, and repeat for all candidate predictors
  • For each predictor j compute the posterior probability (PP) that \beta_j > 0
  • Consider a predictor to be selected when PP > 0.9 and rejected when PP < 0.1
  • Count the number of candidate features for which PP is between 0.1 and 0.9
  • If the fraction of candidates falling into the uncertainty interval exceeds some number such as 0.25, conclude that the sample size is inadequate for the task

I’m interested in comments about the usefulness/reasonableness of this procedure. In my mind this is more useful than computing the false discovery probability. It quantifies what we don’t know.

7 Likes

I’m curious what advantages this procedure might have over fitting a multivariable model including all predictors and using a shrinkage prior such as the horseshoe?
Secondly, is there a risk that confounding/masking in univariable models may lead to us selecting the wrong predictors?

1 Like

FItting a multivariable model with smart choice of penalty will be better, and the Bayesian posterior probabilities can still be computed. I was presenting an oversimplified approach. On your second question, the point is that feature selection methods select the wrong features a large portion of the time even under ideal situations. When you have collinearity, confounding, etc., things are worse. I’m trying to quantify when to give up on feature selection, i.e., when the information content in the data is insufficient for the task.

1 Like

If the features are of similar “types” (like in omics data) what do you think of a slightly modified approach to the above where its not quite a multivariable model but essentially a univariate model that pools the estimate of the coefficients:

X_j represents each feature. Y is the same outcome (n x 1). p is # of features, n is # of samples.

  1. Repeat the Y vector p times in the columns to form an n x p matrix Y

  2. Set a global prior for all the B ~ Laplace(0,scale)

  3. Set individual priors B_j ~ Laplace(B, scale2)

  4. Intercept prior: b0 ~ Normal(0,sd)

  5. Fit the model Y = b0 + (B_j^T)* X where * is element wise multiplication (not matrix mult). B_j is p x 1 so B_j^T is 1 x p, and X is n x p.

Essentially this would result in p univariate models, but the B_j would be pooled whereas the method proposed in the OP is still fully separate models. Wouldn’t this pooled approach be more efficient use of the information in the data, while retaining similar interpretation of a univariate model? And also, the partial pooling would induce some correlation between the estimated B_j’s which seems more realistic.

Joint modeling is usually better, and the same probabilities can be computed as I stated for the repeated univariate case. For joint models I suggest using standard notation and putting horseshoe or other reasonable priors on the \beta s.

I’ve been enjoying reading about feature selection under high conditions and I am curious to learn as much as possible. I have some questions I’d like to clarify. I hope the members of this community can help :slight_smile:

Context:

The general sentiment on (great) websites such as this and Cross-Validated is that expected expecting to get reliable answers for feature importance or perform feature selection when p >> n and n is small (let’s say less than 100) is a hopeless task. Professor Harrell’s recent article on the topic also gave me the same vibes.

Yet, reading literature in fields such as metabolomics – where p >> n, n is often small, and feature selection is a common objective – one gets the impression that it can be done (e.g., reading reviews such as this), and articles exist that seem to have done it. Admittedly, many of these do not validate with repeated sub-sampling and therefore could be sensitive to changes in the data, although you do see it sometimes (such as here, where features were ranked based on 100x repeats of cross-validation, and here with repeated bootstrapping).

I’ve also been experimenting with a feature selection process on a real metabolomics dataset (k = 900 and n = 90), and although my prediction test accuracy is (unsurprisingly) very low, certain features are repeatedly selected across different models, linear and nonlinear, in my procedure.

Finally, I can make simulations in Python (e.g., with sklearn’s make_regression) and use Lasso to both (successfully) select the most important features and preserve R-squared score versus standard linear regression despite increasing the number of irrelevant variables. This shows it is possible to identify a small subset of predictors from a larger dataset, so does this simply mean that such simple simulations are entirely non-translatable to problems involving complex real biological data?

Questions:

  1. How can I reconcile this apparent mismatch between the hopelessness of the task versus it apparently having been done and written about in highly cited papers?

  2. I presume one answer to 1) would be that these studies would not stand had repeated subsampling or external validation been done. But then what about instances where this has been done and the results remain similar? Does this not suggest it is not a hopeless task, and that it can be done?

  3. With regard to the third paragraph and my own experience, I see how trusting variables selected from models with poor predictive performance could be very shaky. However, I can also see how the variables could genuinely be important and had n been greater, predictive performance could also have been better. Does this latter interpretation have any validity?

  4. The question in the final paragraph would be last question.

Thanks!

Without getting into all aspects of your important post, I’ll make a few observations.

  • I think your assessment of performance isn’t quite stringent enough. We need a feature selection method to simultaneously select truly important features and not select truly unimportant features.
  • The fact that omics uses feature selection ad nauseum is not correlated with it actually working. The vast majority of research based on p >> n with n not large is delusional. Almost none of the researchers even know there is such a thing as a false non-discovery probability.
  • lasso has very little chance of selecting the “right” variables. lasso is about prediction and parsimony but most researchers do not recognize the instability of variables selected by lasso.
  • A very enlightening experiment would be to formulate a reasonable Bayesian prior for the distribution of biomarker effects (e.g., log odds ratios) and for 10,000 candidate features compute for each one the following posterior probabilities (OR = odds ratio).
    • P(OR > 1.2 or < 1/1.2) (probability of non-trivial effect)
    • P(1/1.2 < OR < 1.2) (probability of trivial or no effect)
    • Feature selection can work if say 9,500 features have either the first probabiity > 0.9 or the second > 0.9
    • Instead you’ll find the vast majority of features are in the gray zone where you don’t know if the feature is useful or not. In many cases you will know as much after collecting data as you did before collecting data.

Bayesian posterior probabilities, unlike p-values and false discovery probabilities, fully expose the difficulty of the task by not being abused by an “absence of evidence is not evidence of absence” error.

1 Like

Thanks a lot professor Harrell, much help as always.

  1. Agreed. 4) I’d love to give this a go. I’m currently learning about Bayesian approaches (Gelman and McElreath) so it might be a nice exercise.

2&3) Also agreed (I see a lot of bad statistics and ML even in top journals), however I refer more to cases for which it does appear to work. These are the situations I’m most interested in reconciling versus the theory that it is “delusional”.

What about when they repeat CV x100 and feature selection appears stable, and the features appear to be important to the outcome (as here where selected features seem causal for the phenotypic outcome under investigation)?

What about when they externally validate and regression performance remains adequate and the same features are selected?

Are the success stories purely data dredging or “luck” (in the case of robust validation methodology)?

I’d love to understand this.

Finally, I’d much appreciate an answer on Question 3). This seems of crucial importance in the area of variable selection yet I’m really not sure about the answer.

Thanks again for all the help.

Oh, additionally, do you have any literature off-hand that proves/supports the notion that variable selection when p>>n and n is not large is doomed (or at least unlikely to be fruitful). I think if I can understand why this is the case, I think about different solutions to the original problem or different problems entirely that are worth tackling.