Good ways to verify coverage of CIs for probabilities in variational bayes binary classification?

ludvighult · October 1, 2025, 8:20am

In short: a bayesian binary predictor can provide a posterior on the predicted probability, providing showing a point prediction (e.g. this patient has 3-22% risk) instead of just providing point predictions (this patient has 4% risk). Can such an interval be grounded and verified by actual data?

Background

Consider binary prediction. x\in \mathbb{R}, y\in \{0,1\}. We would like to predict y from observed x, and have a sense of uncertanity. Plenty has been written on the subject, see e.g. A survey of uncertainty in deep neural networks | Artificial Intelligence Review for a somewhat recent review.

For many reasons, we want to have an (approximate) bayesian solution, so that we can put informative priors and integrate with other models in a bayesian way. To this end, have deep learning model produces two outputs: a(x) and b(x). This provides the approximate posterior on the probability as a \theta(x) \sim Beta(a(x),b(x)) and finally the predictor is \hat{y} \sim Bernoulli(\theta(x)). We can train the model using the ELBO as in standard variational inference, and we can provide different priors a_0(x), a_1(x) giving us some regularisation and interpretation.

This approach is presented in the review above as a ‘single determistic method’ and has been presented in the litterature, generalized by having categorical labels and Dirichlets/Categoricals instead of Beta/Bernoulli. Previous work suggest that the distribution for \theta(x) can be used to assess model certanity. If the total concentration a(b)+b(x) is small, there is supposedly more model uncertanity. We can present a 95% credible interval to illustrate this for a model user.

The questuion

If I say to a patient “This patient has a risk in the credible interval 3-22%” how can I validate that this interval has a grounding in actual model performance? Would the patient-specific risk really fall in this interval with some rate? Can I monitor my model for underconfidence/overconfidence? Does it correspond to anythin that can be empirically tested?

It seems to me there is no way to do this. Is there some literature or other point I have missed?

davidcnorrismd · October 2, 2025, 8:36pm

Reminds me of famous quote. "Doctors say that Nordberg has a 50/50 chance of living, though there's only a 10 percent..." - The Naked Gun: From the Files of Police Squad! quote

ludvighult · October 4, 2025, 9:58pm

An excellent quote.

ehudk · October 9, 2025, 7:57am

Sounds like maybe some extension of probability calibration curves may help?

First, I will just say that I believe it’s not possible to test the coverage for a specific patient, because that patient is a single realization from the underlying stochastic process of the world, and you can’t get any statistics about them. So the only way to have multiple realizations of the same patient is if you simulate data and compare your fitted posterior distribution with the simulation distribution.

But, since I figure you’re looking to evaluate your model on your actual real data, what I think might be helpful is the literature around probability calibration, see this van Calster et al. for example.
There are very coarse calibration checks, which I’m assuming you already know how to test for (for example, take a test set / out of bag set, and test if the observed rate is covered by the average of your prediction).
So I believe you might want to focus on finer-detailed calibration curves, but the tricky part is how to extend it from point estimations (means) to credible intervals.
So It won’t give you the coverage for a specific patient, but you can say something like “for all patients whose credible interval was in range x, x’% of them actually had the outcome”, and therefore group similar patients (similarly likely to have an outcome) together.
however, because I can’t come up with a method to group intervals together right now, one approach would be to test the calibration of the edges of the credible interval, somewhat similar for asserting the calibration of quantile regression. If both ends and the center of your interval are properly calibrated, that would be reassuring.