I’m learning the tidymodels suite of R packages, and I have a question about splitting and resampling. From the
A typical scheme for splitting the data when developing a predictive model is to create an initial split of the data into a training and test set. If resampling is used, it is executed on the training set. A series of binary splits is created. In
rsample, we use the term analysis set for the data that are used to fit the model and the assessment set is used to compute performance:
In this approach, nesting resampling makes a lot of sense to me as a way to avoid optimization bias. You take the average of the assessment sets to tune the final model.
But with this approach, the final model is evaluated against the original testing split, which is just one split of all possible testing/training splits. Is the approach shown in the diagram preferable to not splitting and just using resampling to validate the test set? I think @f2harrell is arguing here for not splitting, but I’m not sure.