Discussion of A General Sample Size Framework

f2harrell · May 23, 2026, 2:28pm

This is a discussion of the 2026-04-29 paper "A general sample size framework for developing or updating a predictive algorithm: with application to clinical prediction models" by RD Riley, R Whittle, M Sadatsafavi, GP Martin, A Pate, GS Collins, & J Ensor in BMC Medical Research Methodology. I look forward to comments being added here by @Gary_Collins and the other authors in addition from readers of the article.

This is an excellent framework, and the video presentation about it here is extremely clear and helpful. A strength of the framework is the provision of uncertainty distributions for metrics such as calibration slope.

Lasso performed better than I anticipated in one of the examples. I would be interested to see a stability analysis added to all lasso examples.

Here are a few questions to start the discussion.

Why would we trust a reference model? Most models are developed using at least one categorization of continuous or ordinal variables, assuming too much linearity, and not sufficiently considering interactions. On the other hand if we use machine learning to create a reference model, it will allow interactions to be as strong as main effects and will likely result in a large amount of overfitting. Example: a reference model that dichotomized age will pretend that a discontinuous effect of age is correct, and will reward a predictive model that also dichotomizes age (at a similar threshold).
Adjusted R^2 gives a nearly unbiased estimate of model performance (and Bayesian model fitting gives an uncertainty interval for it) that corrects for within-sample overfitting, but AIC is needed (with twice the penalty of adjusted R^2) to estimate likely out-of-sample performance. How do we know that the proposed model gives us enough of the “out-of-sample” variability penalty?
Will the new methods give meaningfully different results than simulating new data using data “like” the data you are going to have in the new study, using a large sample size, playing each developed model against the true risks that are simulated from, and lowering the sample size to finding the breaking point for the proposed model development algorithm?

Trivial note: intervals for calibration slope should be [0.9, \frac{1}{0.9}] not [0.9, 1.1].