This study aimed to evaluate the disease prognosis using machine learning models with iterated cross validation (CV) method. A total of 122 patients with pathologically confirmed DLBCL and receiving rituximab-containing chemotherapy were enrolled.
To do anything resembling a rigorous scientific process, more data is needed for prediction. Just over 100 observations isn’t enough for regression models that efficiently use information; Modern ML requires even more data.
Quoting Shmueli in a 2010 paper published in Statistical Science:
Finally, predicting new individual observations accurately, in a prospective manner, requires more data than retrospective inference regarding population-level parameters, due to the extra uncertainty
Frank Harrell’s posts on ML vs Statistical models
This post is especially important:
van der Ploeg, Austin, and Steyerberg (2014) in their article Modern modelling techniques are data hungry estimated that to have a very high chance of rigorously validating, many machine learning algorithms require 200 events per candidate feature (they found that logistic regression requires 20 events per candidate features). So it seems that “big data” methods sometimes create the need for “big data” when traditional statistical methods may not require such huge sample sizes (at least when the dimensionality is not extremely high).