I would like to discuss this article by van der Ploeg, which I found reading through some great materials of Professor Harrell yesterday. I have some issues with, and some doubts about, it. I do not intend to attack the paper or the authors here but criticize in order to learn.
Briefly, the authors aim is “to study the predictive performance of different modelling techniques in relation to the effective sample size (“data hungriness”)”, and for this these compare logistic regression, CART, random forest, SVM, and neural networks on 3 artificially generated datasets, comparing such results as AUC, AUC versus number of events per variable and optimism versus events per variable.
A main concern should be that all the models are used in their default parameters, without any tuning. Unless I’m missing something, I perceive this to be quite a critical blow to the aim of the authors, since the research question here would change to: “In their default parameters, how predictive performance of different modeling techniques relate to the effective sample size (“data hungriness”)”.
In this scenario, the answer is dependent on two things: 1) the default parameters of the models and the effect this has on overfitting potential, and 2) the complexity of the models, wherein simpler models might be expected to perform better in their default parameters (although this also depends on point 1)). For example, the default parameters for SVM from e1071 are a fixed C =1 and a gamma of 1/(number of variable) (though they change kernel to polynomial), and RF from randomForest (both in R) with number of trees = 500 and max number of nodes (i.e., tree depth) = None. These are rather important tuning factors and failing to tune them could easily lead to overfitting. There is also no consideration for transformation of variables to accommodate logistic regression. The authors admit these limitations in their discussion but fail to elaborate on how this decision could be fatal to the conclusions of their paper (presuming my points here are valid) …
Next, the authors “generated artificial cohorts by replicating the HNSCC cohort 20 times, the TBI cohort 10 times and the CHIP cohort 6 times.” (HNSCC, TBI and CHIP are medical datasets). I’ve been interested in learning how to do good simulations recently, and this approach surprised me. I guess it’s like bootstrap with replacement, but I also have questions about that too. Namely, does the act of using the same data multiple times not lead to inflating the effects of “artifacts” that might exist in the data? For example, if by chance a subset of data points contains a peculiar set of characteristics that impact model outcomes in a certain way, by replicating the data and sampling these subjects multiple times over, can it not provide unjustified confidence to their results? I would not have thought that simply replicating a dataset X times over would represent a viable way to do simulation experiments.
Admittedly, I do not understand their approach with regard to “Reference Models”. I’ve read it a few times but it still seems unclear – does anyone else share this concern or just me?