Bias correction for c-statistics

langsetmo · October 17, 2020, 9:38pm

I have a question about MULTIVARIABLE PROGNOSTIC MODELS: 1996, in particular the part about estimating the optimism of the C-statistic. Per algorithm one should use bootstrap sample to create model, the calculate difference between C-statistic for predictions on the bootstrap sample and the original sample. In this formulation, the two datasets share observations, so this is NOT like n-fold cross-validation. My question is how is that estimate not optimistic? The two samples share observations, thus information is shared. Am I missing something?

f2harrell · October 18, 2020, 2:33pm

It is a fair estimate of likely future performance, i.e., is not biased, because the model developed from a sample with replacement is “super overfitted” and you are comparing it to the original sample where it is “regular overfitted”. The principle behind the bootstrap in this context is expressed by this equation:

super overfitted - regular overfitted = regular overfitted - no overfitting

The latter difference is what we are trying to estimate using an independent test sample.