Choosing variables for mutivariable survival prediction

I am an oncologist and I am trying to grasp the concepts of variable selection for prediction of survival. This is a wonderfully educative forum, and I thought I’d ask here.

There is a dataset of approx 600 patients with survival outcomes. We are trying to see how well a set of clinical and pathological features (some continuous and some categorical) will help in prediction of individual risk of poor outcome (death/metastasis etc). There are about 15 such variables.
Traditionally for statistical analysis, we would first try out each feature for univariable prediction (log rank) and then multivariable prediction (Cox-PH) and report the variables that are independent predictors.
However, when selecting variables for multivariable prediction of outcome:
a) Is there any benefit in going through the above process and selecting a subset of variables for modelling? If there are variables that are not useful, will they just not have a very small coefficient and have little impact, or will they always contribute to substantial overfitting?
b) is there a recommended method for selecting variables for survival/censored outcomes?
The main reason I ask is that I feel that as dataset size changes, variables that are ‘not significant’ in multivariable analysis may become ‘significant’ or vice versa. So if the total number of variables is not huge in comparison to the dataset size, why not just leave all of them there in the model?

2 Likes

You should definitely buy and read * Ewout W. Steyerberg’s Clinical Prediction Models A Practical Approach to Development, Validation, and Updating. I cannot recommend this book enough. I would also recommend working with a statistician who has experience with clinical prediction models, as these are notoriously difficult to build if your aim is to create anything of value. Blunt, but true. Selection of variables should not be done using statistical methods on the same data you intend to use to develop your prediction model unless your sample size is huge. Consider it 'double dipping". The textbook I recommended goes into this in detail and others have recommended papers that detail this elsewhere on datamethods.org.

One thing that should be eye opening is to start by calculating the minimum sample size requirement for your model. I think you will find that under reasonable assumptions that your 600 person sample will likely only support using 3-5 predictors. Here is a good paper on sample size requirements for clinical prediction models.

4 Likes

Thank you @Elias_Eythorsson. Bought the book as you suggested and reading. I have tried to wrap my head around the sample size paper earlier. I do get most of the concepts, and will try to apply them in practice.
The practical disadvantage we have in India is that experienced statisticians are hard to come by. And there is unfortunately no one within my horizon who is experienced in building a clinical prediction model of any value, as you say. In fact, even in my reading of literature, I come across papers in some of the leading journals which seem to lack robust design and statistical planning.
Unfortunately funding is a perennial issue but I am hoping to connect with people here who may want to come on board as co-authors.

3 Likes

I worked through the sample size calculations for a logistic model here if that helps at all.

4 Likes

And to reiterate an important point: variable selection is a bad idea because the data do not contain sufficient information to tell you which variables to select. Focus on clinically-based model prespecification and use data reduction (unsupervised learning) it the sample size does not allow you to use all the clinically pre-specified variables as single predictors.

4 Likes

Thank you @f2harrell and @Elias_Eythorsson for the excellent suggestions and examples.
I have given Steyerberg’s book a readthrough. It is truly an excellent book, and understandable for the non-statistician.
My conclusions for practice based on the book and your suggestions are:

  1. Fully prespecified model is preferred (and much easier from the clinical standpoint)
  2. Select predictors from published literature, and not from the data (to avoid testament bias)
  3. Assess number of events in the cohort (effective sample size, I suspect approx 170/600 in my cohort)
  4. Calculate number of predictors to keep EPS > 10, closer to 20 keeping in mind the df from the predictors.
  5. If the candidate predictors are more numerous, consider dimensionality reduction with PCA (or other techniques)
  6. Cox-PH with consideration of penalty (?LASSO/Elastic Net)
  7. Use bootstrapping/cross-validation for internal validation
  8. Report c-index, brier score (?calibration ?what else)
  9. Follow up with external validation (temporal/geographical)
  10. Present with a web-based dashboard aka. PREDICT

I hope I’ve got it more or less correct.
Thanks

3 Likes

Better we report optimism corrected AUC and calibration plot.
We can use prof Harrell s rms package.
Instead of EPV/EPP criteria, sample size can be better calculated. We can follow those good articles on that by smeden and Riley

1 Like

Thanks @Dr.ya.dev. We will report optimism corrected c-index and calibration.
And I have read the Riley paper for the n-th time, and I can finally claim to have understood most of it :grinning:. I have tried the formula suggested and the sample size looks achievable. I am glad I wrote in this forum what I thought was a dumb question, but it’s been highly educative for me.

4 Likes

First of all, I’d like to say that I am learning so much by following this discussion! Thank you all.

I am working with a dataset containing clinical and blood proteomics features (175), and the goal of the project is to predict the occurence of acute allograft rejection after kidney transplantation. It’s a small dataset, 100 samples (75 controls, 25 cases).

Based on subject matter knowledge, I have a pre-defined set of 6 clinical predictors to build a baseline model, therefore they don’t count as candidate predictors. My goal is to assess the added value of proteomic markers. I will use Logistic regression as the prediction model.

In order to identify candidate markers, my plan is to perform a PCA and select the top 2 or 3 markers based on factor loadings of the first principal component. Is this a reasonable approach?

I initially planned on using lasso penalization on both the ‘clinical model’ and ‘clinical+marker model’ to increase the chances of generalizing our model to external data. However, if I follow this plan, wouldn’t I increase the number of candidate predictors, since lasso performs feature selection? Maybe it would be a better a idea to use ridge regression?

Perhaps this is a basic question. I am still on the process of understanding all the great information on this site and the RMS book.

Thank you in advance!

Just a few remarks.

  • lasso and the like has a bad track record of identifying the right predictors
  • The PC approach you mentioned may be OK. Sparse PCs may work better, using the first k of them and not dropping any of their component variables.
  • I don’t agree with “they don’t count as candidate predictors”
  • How many outcome events are there? Is it 25? That would not be enough for any of these analyses, I fear.

Thank you for the insights @f2harrell!

Could you please elaborate on why would this set of pre-defined clinical features count as candidate predictors? I thought that selecting them not using my dataset outcome vector was enough to make them “not candidate predictors”. I mean, if I don’t use lasso or any other feature selection method.

And yes, there are only 25 events. Unfortunately, at the moment, that’s all data I have. Perhaps in the near future we can collect more.

The idea of using only 2 or 3 component variables was due to the costs associated with measuring all 175 protein levels using the method my lab is relying on. It wouldn’t be possible to do that for all future blood samples. I agree that using the pcs themselves as predictors would be a better approach.

Okay, there’s much I still need to learn. I just read about Sparse PCA. So, please disregard my insistence on the need to use just 2 or 3 component variables.

I think my remark about ‘candidate variables’ is just semantics. I typically use candidate to refer to variables that are given an opportunity to be in the model. These variables may be variables we plan to keep no matter what using subject matter knowledge or variables that we may be using a dangerous variable selection procedure on. To be more clear maybe we should say forced variables and other candidate variables.

25 events makes it very, very hard to learn anything. This is a key problem with using binary events rather than matters of degree. One thing that will help keep the difficulty of the task front and center is confidence intervals for variable importance measures such as what you can get with rms::rexVar. These will be scary wide but honest. This approach requires inclusion of all candidate variables that you considered in supervised learning.

Oh, now I get it. Thanks @f2harrell Indeed, if my assumption was correct, we could include as many pre-specified variables as we wanted without spending the allowed amount of candidate predictors, which makes no sense.

Regarding the difficulty in learning with only 25 events, there is still hope :joy:. One of the candidate predictors represents the occurrence (or not) of a complication after the transplant procedure, assessed at day 3 post-transplant, that is a known risk for acute rejection. In fact, the blood proteins are also measured at this time point. I will keep you posted on how the experiments go. Thanks again for the help!

Are you sure you want to use post time-zero predictors? That approach requires you to use a landmark analysis where to qualify to be in the cohort the patient has to survive 3 days. Then predictions start with a new time zero of day 3 and cannot be used pre-op.

1 Like

I agree that ideally we would use only predictors available pre-transplant, however the blood proteins are only measured at day 3 and day 7 post-transplant.

Maybe I could evaluate three models based on:

  1. Only pre-transplant predictors
  2. Pre-transplant + day 3 complication
  3. Pre-transplant + day 3 complication + protein marker selected via unsupervised learning.

What’s your view on this approach?

I am following the recommended evaluation method using bootstrap and checking ROC-AUC, Brier score and calibration.

It is fine to use pre-transplant with post-transplant data but the clock has to start over and nothing learned will be useful for pre-op prediction. Also you’ll need to have lots of post-day-3 events.

Great then. Thanks professor!

In fact most of the events occur after day 3. Once this initial modeling project is over, I’ll work on a ‘biomarker trajectory’ exploratory analysis.

Good, but note that the dataset is not adequate for the simplest question (estimating overall incidence), and you are asking questions that require at least 10x the sample size needed to answer the simplest question. I think the dataset may only be adequate for exploring things like variable clustering of biomarkers, and even then that depends on correlation coefficients being estimated well.

Actually, I think I might be falling into a selection bias problem and bad formulation of a research question and only now I realized.

Here is the full picture:

I have a dataset with information from 223 transplant patients. 102 were never suspected of allograft rejection during the first 6 months after transplantation (no biopsy performed), and 121 were suspected and undergone at least one biopsy. Out of these 121, only 36 had biopsy proven acute rejection at the moment of first biopsy. The majority of the biopsies occured in the first 20 days after transplantation (~75%).

Out of all patients, we have the proteins measured for 183 of them, with 75 of them from the ‘never suspected’ group and 25 from the rejection group. The remaining 83 had different conclusions based on the biopsy assessments by pathologists.

Using the ‘never suspected’ group as control and only the rejection group as events is not ideal, correct? In this case I should include all other samples that didnt have biopsy proven acute rejection in the control group? Sorry for all these basic questions…

Thank you so much for the help and for your time!