Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints

I would like to discuss this article by van der Ploeg, which I found reading through some great materials of Professor Harrell yesterday. I have some issues with, and some doubts about, it. I do not intend to attack the paper or the authors here but criticize in order to learn.

Briefly, the authors aim is “to study the predictive performance of different modelling techniques in relation to the effective sample size (“data hungriness”)”, and for this these compare logistic regression, CART, random forest, SVM, and neural networks on 3 artificially generated datasets, comparing such results as AUC, AUC versus number of events per variable and optimism versus events per variable.

A main concern should be that all the models are used in their default parameters, without any tuning. Unless I’m missing something, I perceive this to be quite a critical blow to the aim of the authors, since the research question here would change to: “In their default parameters, how predictive performance of different modeling techniques relate to the effective sample size (“data hungriness”)”.

In this scenario, the answer is dependent on two things: 1) the default parameters of the models and the effect this has on overfitting potential, and 2) the complexity of the models, wherein simpler models might be expected to perform better in their default parameters (although this also depends on point 1)). For example, the default parameters for SVM from e1071 are a fixed C =1 and a gamma of 1/(number of variable) (though they change kernel to polynomial), and RF from randomForest (both in R) with number of trees = 500 and max number of nodes (i.e., tree depth) = None. These are rather important tuning factors and failing to tune them could easily lead to overfitting. There is also no consideration for transformation of variables to accommodate logistic regression. The authors admit these limitations in their discussion but fail to elaborate on how this decision could be fatal to the conclusions of their paper (presuming my points here are valid) …

Next, the authors “generated artificial cohorts by replicating the HNSCC cohort 20 times, the TBI cohort 10 times and the CHIP cohort 6 times.” (HNSCC, TBI and CHIP are medical datasets). I’ve been interested in learning how to do good simulations recently, and this approach surprised me. I guess it’s like bootstrap with replacement, but I also have questions about that too. Namely, does the act of using the same data multiple times not lead to inflating the effects of “artifacts” that might exist in the data? For example, if by chance a subset of data points contains a peculiar set of characteristics that impact model outcomes in a certain way, by replicating the data and sampling these subjects multiple times over, can it not provide unjustified confidence to their results? I would not have thought that simply replicating a dataset X times over would represent a viable way to do simulation experiments.

Admittedly, I do not understand their approach with regard to “Reference Models”. I’ve read it a few times but it still seems unclear – does anyone else share this concern or just me?

2 Likes

Great topic for journal club. I hope the authors such as @Ewout_Steyerberg @ESteyerberg join in.

The point about needing to tune machine learning (ML) methods is a good one. However the fact that a lot of tuning is needed can be a detraction. A colleague of mine spent 6 months tuning parameters for a random survival forest after it initially overfitted by a historical amount. The end result was that an intensively-tuned random forest performed as well as a Cox regression model.

Though tuning (not using default parameters) will make a difference, I predict the key result will stand, for a very simple reason. The dominating predictive force in medical applications is that effects of predictors are largely additive. ML is unable to capitalize on that. By allowing equal opportunity for interaction effects as for main effects, ML puts itself at an instant disadvantage over regression models. This is particularly true in low signal:noise situations (which are almost always the situations in medicine).

2 Likes

Nice, thanks for tagging them Professor Harrell. I’ve benefitted from various pieces of Steyerberg’s work for various projects.

That’s a nice anecdote! Although my reading tells me Cox tends to perform better than RF-survival anyway, so I am not too surprised.

Lots of tuning can indeed be con, especially when it takes a lot of time, but my point more pertains to the idea that comparing models in the default (a mode in which they are not designed to be used) doesn’t really prove the general case, since most of the time models are not used in the default, especially when overfitting is a concern.

WRT your final paragraph, is this really the case? Does ML in this scenario not just simply allow for nonlinear effects, rather than expecting them?

Also, on what do you base that “dominating predictive force in medical applications is that effects of predictors are largely additive”? I recall reading various medical literature suggesting the opposite, that linear assumptions are often not justified (e.g., plasma metabolites reach maximum levels, disease risk increasing exponentially with linear increases of biomarkers, etc). Maybe (likely) I am misremembering or incorrect, hence my question.

Thanks as always, Professor Harrell.

You misquoted me; I said non-additive not nonlinear. ML doesn’t necessarily expect non-additive effects but it allows for them and allows them to have equal opportunity with additive effects.

Again you are mixing linearity with additivity. The medical literature is indeed rich in providing evidence that linearity almost never holds. It is semi-rich in providing evidence that additivity (avoiding interactions) is a good approximation to reality. So this is where ML has a disadvantage; ML devotes too many parameters to interactions where it should be placing more emphasis on additivity. That’s why ML is seldom needed in medical diagnosis and outcome studies that do not involve dense data such as images and audio. You can easily allow for all predictors to be nonlinear (usually needed) while fitting a mainly additive model with a few pre-specified interactions.

3 Likes

This reminded me of Norman Matloff’s interesting, yet controversial proposition about neural nets.

Blockquote
These are rather important tuning factors and failing to tune them could easily lead to overfitting. There is also no consideration for transformation of variables to accommodate logistic regression. The authors admit these limitations in their discussion but fail to elaborate on how this decision could be fatal to the conclusions of their paper (presuming my points here are valid)

While I would not discourage or dismiss empirical comparisons of ML vs. traditional stats, those who are enthusiastic RE: modern ML methods need to address the criticisms by David Hand in a paper that should be more widely cited. In essence, traditional statistical models capture most of the “low hanging fruit” in terms of information gain. This makes it hard for ML models to compete on a fair playing field if real learning is the outcome metric, especially if they wish to relax assumptions on possible interactions.

If we accept the basic mathematical theorems of statistics, it is hard to see how throwing computer algorithms at a problem can improve upon the outcome (relative to the best statistical method) without massive amounts of data. Even here, statistics has Empirical Bayes methods that Bradley Efron has written about over the past 20 years for precisely the “big data” problem.

Blockquote
Though the Stat leaders seem to regard all this as something of an existential threat to the well-being of their profession, I view it as much worse than that. The problem is not that CS people are doing Statistics, but rather that they are doing it poorly: Generally the quality of CS work in Stat is weak. It is not a problem of quality of the researchers themselves; indeed, many of them are very highly talented. Instead, there are a number of systemic reasons for this, structural problems with the CS research “business model”

Sometimes I wonder if the more extravagant ML claims contradict a basic theorem of information theory: ie. the Data Processing Inequality

3 Likes

Thanks for the explanation Professor

Thanks for the links - will take a look at the papers!