Recommendations for the development of prediction models for small sample studies (rare disease/low outcome)


What are your thoughts on this and going forward?

What do you recommend/ or the current practice when working with low outcome(events) or rare disease studies when developing prediction models?

Can TRIPOD be used as a checklist for reviewers as well? Not only for authors. Or is there any updated tool with ML aspects included?

From the article:

This study aimed to evaluate the disease prognosis using machine learning models with iterated cross validation (CV) method. A total of 122 patients with pathologically confirmed DLBCL and receiving rituximab-containing chemotherapy were enrolled.

To do anything resembling a rigorous scientific process, more data is needed for prediction. Just over 100 observations isn’t enough for regression models that efficiently use information; Modern ML requires even more data.

Quoting Shmueli in a 2010 paper published in Statistical Science:

Finally, predicting new individual observations accurately, in a prospective manner, requires more data than retrospective inference regarding population-level parameters, due to the extra uncertainty

See also:
Frank Harrell’s posts on ML vs Statistical models

This post is especially important:

van der Ploeg, Austin, and Steyerberg (2014) in their article Modern modelling techniques are data hungry estimated that to have a very high chance of rigorously validating, many machine learning algorithms require 200 events per candidate feature (they found that logistic regression requires 20 events per candidate features). So it seems that “big data” methods sometimes create the need for “big data” when traditional statistical methods may not require such huge sample sizes (at least when the dimensionality is not extremely high).