Bayesian predictive projection for variable selection

There have been recent advances on the field of (Bayesian) Projection Predictive Inference, especially through the projpred R package.

The main goal of this method is to find “the smallest submodel which makes predictions similar enough to those of the reference model is selected”. The reference model is the “best-performing (in terms of its predictive performance) model we have at our disposal, and is one we would be happy to use as-is”. This one could see this method as an alternative for variable selection.

How could one apply this method in the realm of clinical prediction model development?

I found an interesting applied example in the literature. They also provide interesting explanations about the method in their supplementary material.

1 Like

I’m glad you started this discussion Arthur, because I think that Bayesian predictive projection is currently the best available approach for feature selection. So it should have a major use in developing clinical prediction models, in the cases for which unsupervised learning (data reduction) is not favored.

1 Like

I noticed you mention about the “full model" in RMS Section 5.5, “A model that contains all prespecified terms will usually be the one that predicts the most accurately on new data.”

I wonder how should one build the full model for projection predictive inference given a fixed database. I think my approach would be along these lines:

  • Given the available variables, prespecify the ones that would be most predictive based on the literature and expert opinion.
  • Build a Bayesian regression model with these variables using spike-and-slab priors
  • Evaluate, in this potential reference model, the (1) posterior sensitivity to the prior and likelihood, (2) posterior predictive checks, and (3) cross-validation and the influence of data on the posterior, as suggested by Aki

What would be your framework in this scenario?

I have to check how to integrate missing data and nonlinear terms in the builiding process of the reference model.

I wouldn’t use spike-and-slab priors, and you may be able to separate the initial full modeling from the later projection step. If projection were not being done I would use normal priors for inter-quartile-range effects as the rmsb blrm makes it easy to implement for splined variables (and linear ones). And we need to get skilled in joint Bayesian modeling to account for missing covariates, and get away from imputation. But multiple imputation also works with Bayes; it just takes a while to run.

Interesting. And what if If projection were being done? Horseshoe priors? Bayesian lasso?

I found this review insightful.

Regarding Bayesian projection this has more than I know on the subject: Projection predictive variable selection – A review and recommendations for the practicing statistician

1 Like

Aki is always doing awesome work.

I liked his pragmatic approach of using horseshoe priors and LOO-CV to limit overfitting.

I will evaluate how to integrate missing data in this workflow. I am also interested on how to deploy a CPM based on predictive projection.

1 Like

It seems multiple imputation is not currently supported by projpred:

“Conclusion is that projpred currently does not support multiple imputation, so you would have to come up with your own solution for this (as @avehtari said: in principle, this is possible” - January, 2023

In the same post, Aki suggested:

You can use it with multiple imputation by repeating the projection and variable selection for each imputed data set and combine the results in the end (this is the usual multiple imputation approach)

I wonder how one would “combine the results in the end”.

There is an opened issue about this in GitHub, but no progress whatsoever.

1 Like

That’s a very interesting issue even with standard (volatile) stepwise variable selection, since each imputed dataset can give rise to a different set of “winning” variables. When they vary too much it’s a signal to not do variable selection.

1 Like

Maybe a stacked approach with weighted regression? Similarly to stacked LASSO

If the selected variables are all the same with different imputed datasets, then that’s it. If the selected variables are different with different imputed datasets, I would use majority voting and report the variation due to missing data uncertainty.

Changing projpred to support output of brm_multiple() is a big task, and unfortunately we have limited resources. I’d be happy to learn more about cases where the simpler approach would not be sufficient and then we could first experiment how much difference there would be if the search in the projpred would support brm_multiple() output.

For the priors, recently we have been using R2D2 type priors more often than horsehoe especially with normal data models, see, e.g. https://users.aalto.fi/\~ave/casestudies/VariableSelection/student.html

1 Like

Aki it’s great having you answer here. Thanks! Is there any chance that your projection method could fit into a full Bayesian missing data modeling framework so that imputation could be avoided altogether?

1 Like

Projection is based on minimizing the KL-divergence from the reference model predictive distribution to the constrained model predictive distribution for each reference posterior draw separately. For many data model distributions, this is equivalent or can be approximated with optimization of the constrained model parameters given mean of the reference model prediction for each reference model posterior draw. In case of joint missing data imputation, each reference model posterior draw includes draw from the missing data distribution, too. The optimization approach to minimize KL does not work well for these latent data parameters. We could approximate by keeping the latent data parameters as fixed, and optimize only other parameters, but then this would be the same as using multiple imputation approach, which is a big task to add as discussed in a github issue.

In theory that KL-projection could be done for any model so that instead of doing the projection draw-by-draw, we would directly optimize to find the full projected distribution, but that task so complicated that we don’t know how to do it in reasonable computation time.

2 Likes