I found this interesting paper by Julian Faraway that describes a more nuanced perspective on the issue of split sample vs resampling, that I offer up for discussion.
From reading the paper, it seems as if there are more areas of agreement than disagreement with the opinions expressed in this thread, but I’m curious if the preferences for resampling might be modified slightly after reading this.
Faraway, J. J. (2016). Does data splitting improve prediction?. Statistics and computing, 26, 49-60. (link)
From the abstract:
We focus on the problem of constructing reliable predictive distributions for future observed values. We judge the predictive performance using log scoring. We compare the full data strategy with the data splitting strategy for prediction. We show how the full data score can be decomposed into model selection, parameter estimation and data reuse costs. Data splitting is preferred when
data reuse costs are high.
Faraway states that the use of resample methods is preferable when:
1, The set of models to be considered is able to be fully specified.
2. The set of models is small, relative to the amount of data.
He states that split sample methods are preferable when:
- The set of models to be considered cannot be fully specified ahead of seeing the data.
- The set of models is large, relative to the amount of data.
- Human judgement is needed for the modelling process.
He elaborates on various modelling alternatives:
A Bayesian approach that assigns priors to models as well as parameters
is possible as in [10] but the approach becomes unworkable unless the space of models is small. In many cases, the space of models cannot be reasonably specified before analysing the data. Another idea is to use resampling methods to account for model uncertainty as in [11]. However, this method requires the model selection process to be pre-specified and automated. It also requires that these processes be implementable completely in software which excludes the possibility of human judgement in model selection.