Development of prediction models for multi-omic data: Is this a valid approach to follow?

Hi all,

I’ve recently come across this paper (Multi-omic machine learning predictor of breast cancer therapy response | Nature) which discusses about development of a prediction model for multi-omic data. What are your thoughts and comments about the model development and validation aspects? Is this a valid approach to follow? Particularly, taking the average of three models to form the predictor.

This link has the supplementary information:


Unless one has tens of thousands of patients, a binary response/no response target variable contains far too little information to yield reliable associations with high-dimensional features.


Yes, indeed. That is the major problem.

  1. Only 147 patient samples in the training model (but don’t know what is the distribution of outcome across the variables)
  2. Using CV will further reduce sample size in the model training step (in a low sample size study like this)
  3. Using random forrest/SVM for this data.
    Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints: Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints | BMC Medical Research Methodology | Full Text
  4. Proposed ensemble method. Are there any references for these type of model averaging methods?
  5. Only reporting C-index as the performance measure and no other methods are provided. No model calibration information provided for the model.
1 Like

I posted some references to older but still relevant papers, in these threads:

  1. Bayesian Model Averaging:
    RMS Describing, Resampling, Validating, and Simplifying the Model - #63 by R_cubed

  2. Frequentist Model Averaging:
    Should one derive risk difference from the odds ratio? - #447 by R_cubed

General coment: ensemble methods and model averaging don’t usually offer any performance gain and are much harder to interpret, as compared to a flexible single model. Exception: when there is uncertainty about model form, e.g., you want to average Cox models with accelerated failure time models.

I think the main challenge with high-D model development is calibration. You can still get at models with good discrimination but these need to be re-calibrated - something the omics community seems to overlook. As I see it, there is a need for developing sample size criteria specifically for predictive modelling using high-D data, at least under a workflow that assumes, e.g., recalibration in a new sample or some form of conformalized prediction. BTW, this was part of my MSc but then life happened and it ended up postponed to the PhD.