I’ve been reading about evaluating the predictive performance of models with internal validation techniques, but the discussions usually don’t mention Bayesian models. I’m assuming this is because predictions from Bayesian models can be probabilistic as well as point predictions, which is worth its own discussion. If we focus just on point predictions, does it make sense to apply the same techniques (split-sample, CV, bootstrap, etc.) to evaluate the predictive performance of Bayesian models? Is the full posterior only necessary for probabilistic predictions and checking model fit?

In my case, I have a Bayesian additive model and I want to compare its point predictions to that of a neural network (NN). My goal is to see if the predictions from the neural network are substantially better than those of the more interpretable model. To make this comparison, I plan to evaluate the squared error loss of the Bayesian model’s MAP and the NN’s predictions on bootstrap samples of the data. Does this seem appropriate?

This is a great topic and you’ll see some discussion here. With the Stan system you can use the loo package to get leave-out-one statistics efficiently, and these can be quite helpful. The rmsb can plot the posterior distribution of various model metrics such as D_{xy}, c-index, Brier score. I need to get a better understanding of when we should use only Bayesian point estimates (e.g., posterior mean \beta coefficients) in computing model performance measures.

First thing that comes to mind is if the model is intended to be used with the point estimates only (e.g. a simple prediction tool) then it would make sense to evaluate using whichever posterior summary you would your tool uses. If you’re trying to evaluate something that uses the whole posterior (e.g. evaluate utility at each mc draw) then something that uses the whole posterior would make the most sense. This seems too simple though so I must be wrong, but I am interested to hear what others think.

Bayes wants you to provide whole distributions (as opposed to point estimates). But we need some research on how and whether to present such honest summaries to patients/consumers.