I’ve recently come across this paper (Multi-omic machine learning predictor of breast cancer therapy response | Nature) which discusses about development of a prediction model for multi-omic data. What are your thoughts and comments about the model development and validation aspects? Is this a valid approach to follow? Particularly, taking the average of three models to form the predictor.
Unless one has tens of thousands of patients, a binary response/no response target variable contains far too little information to yield reliable associations with high-dimensional features.
General coment: ensemble methods and model averaging don’t usually offer any performance gain and are much harder to interpret, as compared to a flexible single model. Exception: when there is uncertainty about model form, e.g., you want to average Cox models with accelerated failure time models.
I think the main challenge with high-D model development is calibration. You can still get at models with good discrimination but these need to be re-calibrated - something the omics community seems to overlook. As I see it, there is a need for developing sample size criteria specifically for predictive modelling using high-D data, at least under a workflow that assumes, e.g., recalibration in a new sample or some form of conformalized prediction. BTW, this was part of my MSc but then life happened and it ended up postponed to the PhD.