I have a question about MULTIVARIABLE PROGNOSTIC MODELS: 1996, in particular the part about estimating the optimism of the C-statistic. Per algorithm one should use bootstrap sample to create model, the calculate difference between C-statistic for predictions on the bootstrap sample and the original sample. In this formulation, the two datasets share observations, so this is NOT like n-fold cross-validation. My question is how is that estimate not optimistic? The two samples share observations, thus information is shared. Am I missing something?
It is a fair estimate of likely future performance, i.e., is not biased, because the model developed from a sample with replacement is “super overfitted” and you are comparing it to the original sample where it is “regular overfitted”. The principle behind the bootstrap in this context is expressed by this equation:
super overfitted - regular overfitted = regular overfitted - no overfitting
The latter difference is what we are trying to estimate using an independent test sample.