I am a beginner in this space, starting to get my feet wet with trying to analyze large datasets such as the Cancer Cell Line Encyclopedia (CCLE) and related resources for predictive markers. Please excuse errors in the discussion below as the product of a beginner…
So, I was looking at the methods used in Barretina et al., The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012 Mar;483(7391):603-7.
The authors in this paper used an elastic net regression method, to identify mutations or expression of certain genes that predicted drug response. In order to do so (please refer their supplemental methods), they first performed 10-fold cross-validation, in order to identify ‘optimal’ alpha and lambda values, followed by a bootstrapping procedure wherein they used these pre-determined alpha and lambda values, to calculate how frequently features of interest were identified with each bootstrap instance, with the idea that the ‘most predictive’ features are pulled out as significant in a high proportion of these bootstrap instances (>80%).
Does this approach make sense?
Wouldn’t a preferable approach be that instead of supplying pre-computed alphas and lambdas from an initial 10-fold cross validation, that these alphas and lambdas be recomputed with each bootstrap, as the ‘optimal’ value for these parameters changes depending on the resampled dataset being trained/tested with each bootstrap? Moreover, why even bother with this procedure, as opposed to, for example, using 10-fold cross validation repeated 10 (or 100) times?
It seems to me that their approach is a strange combination of bootstrapping subsequent to K-fold cross validation that doesn’t ‘feel’ right?