When developing clinical prediction models for an outcome (e.g. risk of death in a population with a particular disease, as a function of certain exposure variables), should I split the total population in half for 1/2 derivation, 1/2 validation? Fortunately I have 2 distinct cohorts so will satisfy the need for ‘external validation’ (as oft required to publish retrospective observational studies). It could be derived combining 1/2 from one group and 1/2 from the other, then validated in the rest. Study population is n=1020, one main predictor variable with 3-4 covariates. Seems to me that splitting into derivation and validation cohorts wastes data.

Data splitting lowers the sample size for model development and for validation so is not recommended unless the sample size is huge (typically > 20,000 subjects). Data splitting is unstable unless N is large, meaning that you’ll get different models and different validations at the whim of the random split. The bootstrap is a much higher precision form of internal validation (and note that data splitting is internal validation). The bootstrap needs to repeat all analysis steps afresh each time, at least all steps that used Y.

A recent paper covered inefficiencies of split-sample validation very well. Perhaps someone else can remind us of the reference.

I cover these issues in some detail in my Regression Modeling Strategies course notes and book.

The inefficiencies of data splitting were in our joint paper Frank JCE Commentary on validation

And I added some more discussion later JCE paper on inefficiency of random split-sample validation