Bootstrap vs. cross-validation for model performance

Lynild · December 6, 2019, 11:04am

So I have developed some logistic models, and obviously the next step is to validate them. I have no option for external validation, so internal is the way to go for me. On that note, I have been looking into this article (https://www.ncbi.nlm.nih.gov/pubmed/11470385) by Steyerberg and Harrell, which pretty much talks about internal validation.

So if I am understanding the paper correctly, the most accurate validation is achieved with regular bootstrapping - not even the .632 or +0.632 methods. In general I like bootstrapping more than CV, although often more computer intensive, but I just wanted to make sure that this was the way to go, or maybe that’s different from case to case ?

For example, what I usually hear is that bootstrapping estimates variability where CV is meant to quantify predictive accuracy. But in this paper it seems all things can be achieved through bootstrapping ?

Am I missing something here, or…?

f2harrell · December 6, 2019, 1:05pm

10-fold cross-validation repeated 100 times is an excellent competitor of the Efron-Gong optimism bootstrap, and works even in extreme cases where N < p unlike the boostrap. Bootstrap and CV have exactly the same goals and recommended uses. For non-extreme cases it is faster to do the bootstrap, and the bootstrap has the advantage of officially validating building building with sample size N instead of \frac{9}{10}N. The 0.632 bootstrap was shown to be better only for discontinuous accuracy scoring rules.

Whether using bootstrap or CV, it is imperative that all supervised learning steps be repeated afresh when validating the model. Any analysis that utilized Y including any association-with-Y-based feature selection must be repeated afresh. Internal validation must be rigorous.

Lynild · December 6, 2019, 8:37pm

Thank you for the answer.

So what you are saying is, that it is pretty much up to me/others which technique is used, since they are all fairly good. But the only thing that needs to be emphasized is that all analysis steps must be re-done for all bootstrap re-samples ?

But something I don’t quite understand is that when bootstrapping, you create a new bootstrap sample with replacement, develop a new model so to speak, and then use that on the original bootstrap sample (at least that is how I understood in). Doesn’t that introduce some kind of bias since you are actually developing your model on data that is in some extent also present in the original bootstrap sample ? Isn’t that exactly why CV is used, to avoid that ? Or does bootstrapping include some kind of “magic” where this actually turns out to have very little effect ?

f2harrell · December 7, 2019, 1:06pm

Yes bootstrap and the slower 100 repeats of 10-fold cross-validation are equally good, and the latter is better in the extreme (e.g., N < p) case. All analysis steps must be re-done for both the bootstrap and cross-validation (the latter needs up to 1000 analyses, the bootstrap usually 300-400).

You’ve described the bootstrap process correctly. It sounds strange, but the bootstrap provides an excellent estimate of how much overfitting you have, then you subtract that amount. It is based on this philosophy:

want to estimate the performance in an infinitely large independent sample
estimate how much overfitting you have in your sample and subtract it
bootstrap samples have duplicate observations, result in super-overfitting
bootstrap computes the difference between super-overfitting and regular overfitting, the latter by evaluating the bootstrap model on the original sample
the difference between super overfitting and regular overfitting is the same as the real difference we want to estimate: between regular overfitting and no overfitting

sebastian.marciano · January 17, 2020, 2:43am

Hello everybody. Great discussion! However, I’m having trouble in finding the commands of bootstrapping for internal validation in STATA. Any suggestions?

I am quite new in this forum, but I am really enjoying the discussions.

Thanks!

scboone · January 18, 2020, 10:58am

I’m somewhat familiar with Stata but haven’t performed internal validation in it so far. I came across the following post which might be of use: https://www.statalist.org/forums/forum/general-stata-discussion/general/1476201-bootstrap-for-internal-validation-how-can-we-calculate-the-testing-performance-the-performance-of-bootstrapmodel-in-the-original-sample

Stata has quite an active community on it’s own website/forum and on www.statalist.org! So I’m sure you could find more examples there if you want. When googling for ‘STATA bootstrap internal validation’ quite a few topics similar to the one linked pop up so I hope together those can be of help to you. The process is maybe a bit more laborious than in R (which I think is most used by the visitors here), but it certainly can be done

sebastian.marciano · January 19, 2020, 3:04pm

Thanks so much! Very useful suggestions!

Akshay_Kumar1 · April 4, 2022, 2:29am

@f2harrell - What if the difference between optimism and regular cv is around 20-23 points. Meaning, my regular cv gave 0.45 and optimism gave 0.67. Difference is around 22 points. So, what does this mean and how to interpret it?

f2harrell · April 4, 2022, 11:39am

Which cv algoithm? Unless the sample size is very large you need to do 100 or at least 50 repetitions of 10-fold cv to be precise enough. And what is your sample size, number of events if the outcome is binary or time to event, and number of candidate features that both the bootstrap and cv are dealing with?

Akshay_Kumar1 · April 4, 2022, 1:59pm

@f2harrell , yes am doing 100 fold cv. my outcome variable is binary and class ratio is 77:23. my number of features is 6. dataset size is 1000… I am doing 500 iterations bootsrap…If the regular cv resulted in 0.45 and bootstrap resulted in 0.67, what does it mean?

f2harrell · April 4, 2022, 2:30pm

Are you using the Efron-Gong optimism bootstrap (the appropriate one)? Which accuracy metric are you using? And can you confirm the 6 is the number of candidate features, i.e., the number you examined in a way that was unmasked to Y?