Prediction with uncertain independent variables

robinblythe · February 5, 2025, 8:45am

I am supervising a student on a research project related to their thesis where the data has an interesting quirk.

They are attempting to derive an equation to predict national spending on a certain disease on a country-by-country basis. However, the independent variables they have collected from different sources have been provided as 3 columns pertaining to the best guess, lower bound, and upper bound. There are some 10 variables in their dataset with this uncertainty.

The student planned to just ignore the upper and lower bounds and use the best guess estimate, but I think this is actually a good opportunity to figure out how to incorporate that uncertainty in the modelling stage.

How should this be handled? Intuitively, I would immediately think about a Bayesian approach, but the student needs to graduate in 6 months and this may be too complex. My suggestion was to treat this like a model stability bootstrap - fit a distribution for each variable, then repeat the process of taking a sample and fitting a model to each sampled variable. We could also try pooling the regressions, similar to how MICE handles multiply imputed data.

Thanks in advance!

davidcnorrismd · February 5, 2025, 11:35am

Why suppose Bayes is hard? Sounds like something that could be done in JAGS quite straightforwardly. Consider also that a Bayesian model, being so much more coherent, will be super easy to write up!

Also, squandering an opportunity to learn Bayes … before it’s too late! … would be tragic.

f2harrell · February 5, 2025, 12:39pm

The bootstrap may possibly give you a good worst-case estimate, and an imputation method is probably better than some other methods, but at its heart this is a measurement error problem that should probably be modeled using a Bayesian approach. McElreath’s book and videos has an excellent example of an elegant Bayesian solution to a simple measurement error problem. You may get more help on the Stan discourse site.

ehudk · February 5, 2025, 2:49pm

I’m not sure if this answers your question, but there are off-the-shelf solutions like R’s error-in-variable regression (eivreg) that handles noisy predictors, which is also frequentist if you wish to avoid a bespoke bayesian solution.

Ahmed_Sayed · February 5, 2025, 7:15pm

If you are willing to assume that the uncertainty inherent in the measurements can be reasonably well-approximated via a normal distribution, it’s actually reasonably straightforward to do this using the brms package in R. The me (or mi) commands allow for measurement error (https://rdrr.io/cran/brms/man/me.html and https://rdrr.io/cran/brms/man/mi.html). It would not require learning Stan and much of the syntax used in frequentist implementations can be easily carried over. The links contain examples of implementation that might be helpful.

The upper and lower bounds in this case (again, assuming a normal approximates the measurement error reasonably well) can be used to deduce the standard deviation that these functions require.

js592 · February 5, 2025, 8:58pm

I was in a similar spot a while back and found that measurement error/predictor uncertainty models are very straightforward when working in Stan. See e.g. https://mc-stan.org/docs/stan-users-guide/measurement-error.html#bayesian-measurement-error-model but note that instead of the real tau you would pass a vector of known standard deviations (e.g. vector[N] tau) derived from the upper and lower bounds as data. If your upper and lower bounds are not symmetric you might need something like a skew normal or some other distribution where you can algebraically match the mean and two tail quantiles.

JiaqiLi · February 6, 2025, 6:43am

Totally agree! I didn’t get a good Bayesian statistics education when I was a student. I’m struggling to learn now. It’s really bad.

robinblythe · February 11, 2025, 12:11pm

Thanks for the suggestions.
@ehudk this is an interesting option. I haven’t actually come across eivreg before, will take a look.

To the points that @davidcnorrismd, @f2harrell, @Ahmed_Sayed, @js592 and @JiaqiLi raised, I would normally agree - in fact I did suggest to both the student and their primary supervisor that a Bayesian approach (using brms in this case) would be best. This was met with uncomfortable silence, followed by a change of subject.

Maybe I’ll put some sample code in the student’s github folder to see if they want to try it.

davidcnorrismd · February 14, 2025, 10:21am