Combining bootstrapping and cross validation for predicting sensitivity to drugs in Barretina et al., 2012

I am a beginner in this space, starting to get my feet wet with trying to analyze large datasets such as the Cancer Cell Line Encyclopedia (CCLE) and related resources for predictive markers. Please excuse errors in the discussion below as the product of a beginner…

So, I was looking at the methods used in Barretina et al., The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012 Mar;483(7391):603-7.

The authors in this paper used an elastic net regression method, to identify mutations or expression of certain genes that predicted drug response. In order to do so (please refer their supplemental methods), they first performed 10-fold cross-validation, in order to identify ‘optimal’ alpha and lambda values, followed by a bootstrapping procedure wherein they used these pre-determined alpha and lambda values, to calculate how frequently features of interest were identified with each bootstrap instance, with the idea that the ‘most predictive’ features are pulled out as significant in a high proportion of these bootstrap instances (>80%).

Does this approach make sense?

Wouldn’t a preferable approach be that instead of supplying pre-computed alphas and lambdas from an initial 10-fold cross validation, that these alphas and lambdas be recomputed with each bootstrap, as the ‘optimal’ value for these parameters changes depending on the resampled dataset being trained/tested with each bootstrap? Moreover, why even bother with this procedure, as opposed to, for example, using 10-fold cross validation repeated 10 (or 100) times?

It seems to me that their approach is a strange combination of bootstrapping subsequent to K-fold cross validation that doesn’t ‘feel’ right?


An excellent question. In my view the authors’ mixed approach is not optimal and causes some confusion. Your idea of unifying the steps is the right direction. There is real uncertainty in choosing \alpha, \lambda, and these uncertainties should be taken into account. Admitting that the \alpha, \lambda settings are arbitrary by re-estimating them in each bootstrap sample would be a step forward.

To me an even more rational approach is to think hard about the population of regression coefficient magnitudes and to use that thinking to specify a Bayesian prior on the parameters. Then stick to that prior throughout the rest of the analysis. The horseshoe prior is one that has been successful in high-dimensional feature spaces.


Thank you so much for your response, Dr Harrell. Will look into your suggestions.

The trouble with a paper like the one above, is that numerous subsequent papers then blindly follow the same method, baking in a possible erroneous practice, citing prior literature as justification for their approach. And in most cases reviewers will not comb through supplemental methods to explore the details of how bootstrapping was done.

I appreciate that in 2012, less was known generally about methods such as elastic net, and probably some of the R packages that make implementation so easy in 2021 may not have been available. Nonetheless, I have lab mates using this same approach from Barretina et al. at present, and am trying to persuade them to ‘update’ these practices. Hopefully your feedback will help.

Appreciate your input; thanks again!

1 Like

Thank you both for the discussion. I have had the same questions about a project I have been working on.

Is the idea here that the Bayes model will properly incorporate uncertainty by giving you a full posterior distribution for the horseshoe prior’s ‘global’ and ‘local’ variance parameters? While the elastic net treats its respective penalization parameters (lambda/alpha) as known?

In my experience using modest sample sizes (n ~ 200), I’ve found 10-fold CV for lambda to be highly variable which has always struck me as concerning…

1 Like

There is a good paper showing volatility in shrinkage parameter selection in the low sample size situation: To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets | | Full Text

I haven’t used the horseshoe prior myself. In the Bayesian version of ridge regression you pre-specify the variance of the prior which is equivalent to pre-specifying the shrinkage constant. You specify it on the basis of what is known about magnitudes of effects rather than trying to solve for it.


Just a clarification to the original post, the authors originally did perform 10x10 cv, however, the question still remains regarding the strange way they combined bootstraps and cross validation.