I am an oncologist and I am trying to grasp the concepts of variable selection for prediction of survival. This is a wonderfully educative forum, and I thought I’d ask here.

There is a dataset of approx 600 patients with survival outcomes. We are trying to see how well a set of clinical and pathological features (some continuous and some categorical) will help in prediction of individual risk of poor outcome (death/metastasis etc). There are about 15 such variables.

Traditionally for statistical analysis, we would first try out each feature for univariable prediction (log rank) and then multivariable prediction (Cox-PH) and report the variables that are independent predictors.

However, when selecting variables for multivariable prediction of outcome:

a) Is there any benefit in going through the above process and selecting a subset of variables for modelling? If there are variables that are not useful, will they just not have a very small coefficient and have little impact, or will they always contribute to substantial overfitting?

b) is there a recommended method for selecting variables for survival/censored outcomes?

The main reason I ask is that I feel that as dataset size changes, variables that are ‘not significant’ in multivariable analysis may become ‘significant’ or vice versa. So if the total number of variables is not huge in comparison to the dataset size, why not just leave all of them there in the model?