Choosing variables for mutivariable survival prediction

I am an oncologist and I am trying to grasp the concepts of variable selection for prediction of survival. This is a wonderfully educative forum, and I thought I’d ask here.

There is a dataset of approx 600 patients with survival outcomes. We are trying to see how well a set of clinical and pathological features (some continuous and some categorical) will help in prediction of individual risk of poor outcome (death/metastasis etc). There are about 15 such variables.
Traditionally for statistical analysis, we would first try out each feature for univariable prediction (log rank) and then multivariable prediction (Cox-PH) and report the variables that are independent predictors.
However, when selecting variables for multivariable prediction of outcome:
a) Is there any benefit in going through the above process and selecting a subset of variables for modelling? If there are variables that are not useful, will they just not have a very small coefficient and have little impact, or will they always contribute to substantial overfitting?
b) is there a recommended method for selecting variables for survival/censored outcomes?
The main reason I ask is that I feel that as dataset size changes, variables that are ‘not significant’ in multivariable analysis may become ‘significant’ or vice versa. So if the total number of variables is not huge in comparison to the dataset size, why not just leave all of them there in the model?


You should definitely buy and read * Ewout W. Steyerberg’s Clinical Prediction Models A Practical Approach to Development, Validation, and Updating. I cannot recommend this book enough. I would also recommend working with a statistician who has experience with clinical prediction models, as these are notoriously difficult to build if your aim is to create anything of value. Blunt, but true. Selection of variables should not be done using statistical methods on the same data you intend to use to develop your prediction model unless your sample size is huge. Consider it 'double dipping". The textbook I recommended goes into this in detail and others have recommended papers that detail this elsewhere on

One thing that should be eye opening is to start by calculating the minimum sample size requirement for your model. I think you will find that under reasonable assumptions that your 600 person sample will likely only support using 3-5 predictors. Here is a good paper on sample size requirements for clinical prediction models.


Thank you @Elias_Eythorsson. Bought the book as you suggested and reading. I have tried to wrap my head around the sample size paper earlier. I do get most of the concepts, and will try to apply them in practice.
The practical disadvantage we have in India is that experienced statisticians are hard to come by. And there is unfortunately no one within my horizon who is experienced in building a clinical prediction model of any value, as you say. In fact, even in my reading of literature, I come across papers in some of the leading journals which seem to lack robust design and statistical planning.
Unfortunately funding is a perennial issue but I am hoping to connect with people here who may want to come on board as co-authors.


I worked through the sample size calculations for a logistic model here if that helps at all.


And to reiterate an important point: variable selection is a bad idea because the data do not contain sufficient information to tell you which variables to select. Focus on clinically-based model prespecification and use data reduction (unsupervised learning) it the sample size does not allow you to use all the clinically pre-specified variables as single predictors.


Thank you @f2harrell and @Elias_Eythorsson for the excellent suggestions and examples.
I have given Steyerberg’s book a readthrough. It is truly an excellent book, and understandable for the non-statistician.
My conclusions for practice based on the book and your suggestions are:

  1. Fully prespecified model is preferred (and much easier from the clinical standpoint)
  2. Select predictors from published literature, and not from the data (to avoid testament bias)
  3. Assess number of events in the cohort (effective sample size, I suspect approx 170/600 in my cohort)
  4. Calculate number of predictors to keep EPS > 10, closer to 20 keeping in mind the df from the predictors.
  5. If the candidate predictors are more numerous, consider dimensionality reduction with PCA (or other techniques)
  6. Cox-PH with consideration of penalty (?LASSO/Elastic Net)
  7. Use bootstrapping/cross-validation for internal validation
  8. Report c-index, brier score (?calibration ?what else)
  9. Follow up with external validation (temporal/geographical)
  10. Present with a web-based dashboard aka. PREDICT

I hope I’ve got it more or less correct.


Better we report optimism corrected AUC and calibration plot.
We can use prof Harrell s rms package.
Instead of EPV/EPP criteria, sample size can be better calculated. We can follow those good articles on that by smeden and Riley

1 Like

Thanks We will report optimism corrected c-index and calibration.
And I have read the Riley paper for the n-th time, and I can finally claim to have understood most of it :grinning:. I have tried the formula suggested and the sample size looks achievable. I am glad I wrote in this forum what I thought was a dumb question, but it’s been highly educative for me.