External validity of non-parametric baseline hazard

kluijken · February 4, 2021, 5:01pm

I have a question regarding the external validity of semi-parametric models in prediction of time-to-event outcomes based on baseline predictor measurements only (no time-dependent covariates).

In clinical prediction modelling, it is key that a model performs well outside the derivation dataset. To this end, prediction models are validated in data that were not part of the derivation modelling process. For validation of a Cox proportional hazard model, absolute risk predictions would be obtained using information on the t-year baseline hazard from the derivation dataset, coefficients estimated in the derivation dataset and covariate input from the validation dataset.

For example, if I’d want to obtain the 10-year cardiovascular disease prediction for a male in the general population, I could consult a Framingham prediction model, and go to
https://framinghamheartstudy.org/fhs-risk-functions/cardiovascular-disease-10-year-risk/
The reported 10-year baseline hazard (0.88936) could be exponentiated by a centered linear predictor obtained from the reported coefficients and some covariate input to obtain a risk prediction.

I’m wondering: how valid is the reported baseline hazard at 10 years for new individuals?

Alternatively, the baseline hazard could have been modelled parametrically, which of course entails the risk of misspecification. For prediction modelling, however, I am curious whether parametric baseline hazards could reduce the likelihood of overfitting with respect to derivation data.

Do you have any thoughts on this or suggestions for further reading? Many thanks in advance!

f2harrell · February 5, 2021, 12:26pm

On a side note, external validation is often not as good as rigorous internal validation.

To your question, I don’t think of nonparametric estimators as overfitting more; they just have higher variance making for slightly wider confidence intervals of predictions than you would get from a parametric estimator. And of course the parametric estimator would have to be based on a model that had a decently good fit.

The baseline hazard is just as relevant for new individuals as the shape parameters you would estimate from a parametric model.

kluijken · February 5, 2021, 1:11pm

Agreed that a model should first and foremost be internally validated, that is a helpful starting point.

Putting the comparison non-parametric/parametric as a bias-variance trade-off makes sense to me (if I am paraphrasing you correctly), where the parametric estimator would introduce bias compared to the non-parametric estimator, but might gain some variance.
Your note

is of key importance here.

Thank you for sharing these insights, I greatly appreciate it!