How to perform "validation" for multi-step regression models?

isabellaghement · January 18, 2025, 8:11pm

Hi everyone,

I am working on a project involving a so-called multi-step model and have a question I could use your input on.

For simplicity, assume the multi-step model is a collection of two simple linear regression models, M1 (for step 1) and M2 (for step 2), of the form:

M1: log(Y) = beta_0 + beta_1*log(X) + error

M2: log(Z) = delta_0+ delta1*log(Y) + error

In other words, Y is both an outcome (in step 1) and a predictor (in step 2). The sample sizes are different for the two models.

Now, these two models can be used in a whole bunch of ways and the question is how to “validate” them in each case. (These models have relatively small sample sizes, so bootstrap or leave-one-out cross-validation will be applicable.)

Case 1 [Mean Estimation]: One predicts the value of X from a different model, M3, and plugs it into the fitted equation for model M1 to get the estimated mean value of log(Y) at the predicted value of X. Then one plugs in the latter into the fitted equation for model M2 to get the estimated mean value of log(Z).

Case 2 [Prediction]: One predicts the value of X from a different model, M3, and plugs it into the fitted equation for model M1 to predict a new value for log(Y) at the predicted value of X. Then one plugs in the latter into the fitted equation for model M2 to predict a new value for log(Z).

The “question” has several facets to it, such as:

Should each of M1 and M2 be “validated” on its own?
Is there a way to set up a “combined” validation that reflects the flow of information from model M1 to model M2, while accounting for the differing sample sizes n1 and n2 of the underlying datasets used to fit the models?
In some situations, the “final product” (be it a conditional mean value of log(Z) or an individual value of log(Z)) is further compared against an external threshold T, so it would be helpful to derive the probability that the “final product” exceeds T. Either bootstrap or simulation-based inference come to mind, but how do we combine information across the two models?

For the “Prediction” setting, I could look at R2. calibration, discrimination separately for each of the models M1 and M2. (Would this have to be done for both the log scale AND the original scale?) Currently, when people use these types of models, they either “validate” just model M2 or they don’t “validate” either model.

Also for the “Prediction” setting, when n2 > n1 maybe I could back-predict the values of Y corresponding to the observed Z values using the fitted equation for M2, and then back-predict the values of X that would correspond to the back-predicted values of Y using the fitted equation for M1. That would give me the same sample size n2 for the back-predicted X values as for the observed Z values and I could then use the pairing of (observed Z, back-predicted Y, back-predicted X) to compute my “predicted” value:

delta_0_hat + delta1_hat*[beta_0 + beta_1*log(back-predicted X)]

to compare to the “observed” value of Z. Seems a bit “fudgy”?

When n2 < n1, I could use forward prediction to get the larger sample size n1 for the fitted values corresponding to the observed X values.

Not sure if the above makes any sense? Also, stumped at the moment with respect to what would constitute “observed values” for the Mean Estimation setting?

Thanks!

Isabella

f2harrell · January 21, 2025, 5:09pm

Great questions. I would start with exhaustively validating the two separate building-block models before getting into an overall model validation. And before any of that I would emphasize making the 2 models as flexible as the sample size allows, and relaxing linearity and other assumptions. Also consider relating the all-important Y transformation and reidual distribution assumptions but considering semiparametric models.