Bayesian predictive projection for variable selection

arthur_albuquerque · January 29, 2026, 12:30pm

There have been recent advances on the field of (Bayesian) Projection Predictive Inference, especially through the projpred R package.

The main goal of this method is to find “the smallest submodel which makes predictions similar enough to those of the reference model is selected”. The reference model is the “best-performing (in terms of its predictive performance) model we have at our disposal, and is one we would be happy to use as-is”. This one could see this method as an alternative for variable selection.

How could one apply this method in the realm of clinical prediction model development?

I found an interesting applied example in the literature. They also provide interesting explanations about the method in their supplementary material.

f2harrell · January 29, 2026, 1:36pm

I’m glad you started this discussion Arthur, because I think that Bayesian predictive projection is currently the best available approach for feature selection. So it should have a major use in developing clinical prediction models, in the cases for which unsupervised learning (data reduction) is not favored.

arthur_albuquerque · January 29, 2026, 2:09pm

I noticed you mention about the “full model" in RMS Section 5.5, “A model that contains all prespecified terms will usually be the one that predicts the most accurately on new data.”

I wonder how should one build the full model for projection predictive inference given a fixed database. I think my approach would be along these lines:

Given the available variables, prespecify the ones that would be most predictive based on the literature and expert opinion.
Build a Bayesian regression model with these variables using spike-and-slab priors
Evaluate, in this potential reference model, the (1) posterior sensitivity to the prior and likelihood, (2) posterior predictive checks, and (3) cross-validation and the influence of data on the posterior, as suggested by Aki

What would be your framework in this scenario?

I have to check how to integrate missing data and nonlinear terms in the builiding process of the reference model.

f2harrell · January 29, 2026, 3:20pm

I wouldn’t use spike-and-slab priors, and you may be able to separate the initial full modeling from the later projection step. If projection were not being done I would use normal priors for inter-quartile-range effects as the rmsb blrm makes it easy to implement for splined variables (and linear ones). And we need to get skilled in joint Bayesian modeling to account for missing covariates, and get away from imputation. But multiple imputation also works with Bayes; it just takes a while to run.

arthur_albuquerque · January 29, 2026, 5:50pm

Interesting. And what if If projection were being done? Horseshoe priors? Bayesian lasso?

I found this review insightful.

f2harrell · January 29, 2026, 5:54pm

Regarding Bayesian projection this has more than I know on the subject: Projection predictive variable selection – A review and recommendations for the practicing statistician

arthur_albuquerque · January 29, 2026, 7:36pm

Aki is always doing awesome work.

I liked his pragmatic approach of using horseshoe priors and LOO-CV to limit overfitting.

I will evaluate how to integrate missing data in this workflow. I am also interested on how to deploy a CPM based on predictive projection.

arthur_albuquerque · January 29, 2026, 8:00pm

It seems multiple imputation is not currently supported by projpred:

“Conclusion is that projpred currently does not support multiple imputation, so you would have to come up with your own solution for this (as @avehtari said: in principle, this is possible” - January, 2023

In the same post, Aki suggested:

You can use it with multiple imputation by repeating the projection and variable selection for each imputed data set and combine the results in the end (this is the usual multiple imputation approach)

I wonder how one would “combine the results in the end”.

There is an opened issue about this in GitHub, but no progress whatsoever.

f2harrell · January 30, 2026, 2:49pm

That’s a very interesting issue even with standard (volatile) stepwise variable selection, since each imputed dataset can give rise to a different set of “winning” variables. When they vary too much it’s a signal to not do variable selection.

arthur_albuquerque · January 30, 2026, 10:44pm

Maybe a stacked approach with weighted regression? Similarly to stacked LASSO

avehtari · February 2, 2026, 10:43am

If the selected variables are all the same with different imputed datasets, then that’s it. If the selected variables are different with different imputed datasets, I would use majority voting and report the variation due to missing data uncertainty.

Changing projpred to support output of brm_multiple() is a big task, and unfortunately we have limited resources. I’d be happy to learn more about cases where the simpler approach would not be sufficient and then we could first experiment how much difference there would be if the search in the projpred would support brm_multiple() output.

For the priors, recently we have been using R2D2 type priors more often than horsehoe especially with normal data models, see, e.g. https://users.aalto.fi/\~ave/casestudies/VariableSelection/student.html

f2harrell · February 2, 2026, 1:11pm

Aki it’s great having you answer here. Thanks! Is there any chance that your projection method could fit into a full Bayesian missing data modeling framework so that imputation could be avoided altogether?

avehtari · February 4, 2026, 8:16am

Projection is based on minimizing the KL-divergence from the reference model predictive distribution to the constrained model predictive distribution for each reference posterior draw separately. For many data model distributions, this is equivalent or can be approximated with optimization of the constrained model parameters given mean of the reference model prediction for each reference model posterior draw. In case of joint missing data imputation, each reference model posterior draw includes draw from the missing data distribution, too. The optimization approach to minimize KL does not work well for these latent data parameters. We could approximate by keeping the latent data parameters as fixed, and optimize only other parameters, but then this would be the same as using multiple imputation approach, which is a big task to add as discussed in a github issue.

In theory that KL-projection could be done for any model so that instead of doing the projection draw-by-draw, we would directly optimize to find the full projected distribution, but that task so complicated that we don’t know how to do it in reasonable computation time.

Johannes_Schwenke · April 10, 2026, 10:39am

This might also be of interest:

This paper describes a practical procedure for Bayesian variable selection in non- linear regression and classification models. A first stage model is fit in which all variables are included. Typically, this first stage will not include the prior belief that only a small subset of variables is needed, but it may. Given this first stage fit, we look for functions of variable subsets which approximate the predictions from the first stage fit well. A computationally efficient surrogate model is used to search for approximating functions which depend on low numbers of predictors. Rather than assuming there is some true sparcity, we seek sparse approximations to the non-sparse truth. In the case that our first stage fit involves a Bayesian assessement of the uncertainty, we use this to gauge the uncertainty of our approximation error. If we learn that, with high probability, we can obtain a good approximation to the non-sparse truth using a subset of the variables, we deem that subset to be of inter- est. We demonstrate the procedure in empirical examples involving prediction and classification and simulated examples.

f2harrell · April 10, 2026, 12:30pm

That is the model approximation approach I used in RMS. What is not described above is how you get posterior distributions for the second stage model taking into account uncertainties from both the first stage and from decisions made from the first stage regarding which features to discard.