Prediction and confounding and multi collinearity

My concept is that in prediction modelling confounding is not a major issue as in explanatory modelling. And multicollinearity is also not major issue unless very highly correlated and may not convege. However I can not find any references to cite when folks ask and opine otherwise. I would like to hear from the experts in this group regarding these two in the context of prediction modelling

In the case of predictions, we care about the absolute deviation of our model output (predictions) from observed values and of course, this extends to new data not included in the modeling process.

Generally, we are less concerned with understanding relationships or interpreting coefficients. For this reason, multicollinearity isn’t a great concern because multicollinearity doesn’t necessarily create worse predictions. In some cases, if the collinearity of the dataset is so different from the true level of collinearity, it is my understanding this could impact predictions; this is not common to my knowledge. Overall, the possibility of multicollinearity to obscure coefficient magnitudes and directions is of little concern when we want predicted values. It’s important to note that multicollinearity does not introduce bias (i.e. does not cause expected value of estimator to differ from the parameter of interest).

Regarding confounding, some similar arguments apply in that the end goal of a pure prediction model is output that matches reality closely. Confounding, as opposed to multicollinearity, can create ommited variable bias meaning the expected value of the estimator(s) can differ from the parameter of interest. However, you could have a model that performs well despite confounding (which may pose more of an interpretive problem for coefficients, which again, are not the primary objective for a prediction model).

It mainly comes down to understanding the purpose of the model and what each of multicollinearity and confounding could do to the model. In examining the objective and potential consequences of these scenarios, it can be shown to what extent they will impact your goal.

Simulations may be helpful to show your peers/audience to illustrate the point in each case.

I’ll follow along for some other input as I don’t have particular references to offer.

1 Like

Confounding is in theory not a problem for pure prediction models, but the issue comes because this is only the case if the data you intend to use the model on in production (note, I say production to emphasize this is different from the test or validation sets used during model development itself before it actually gets put in the clinic or used) comes from the same distribution as that of the training set

The issue is, often times in real life this isn’t the case. For robustness to the issue of data drift, confounding can sometimes be important to address in prediction as well to make sure the model is not learning spurious associations that are sensitive to this problem. Its not going to fully solve the issue, but it helps.

With a given model though confounding is only possible to address fully wrt 1 variable (the exposure). The other variables’ associations may still be confounded because their adjustment sets in a DAG could be different.