Model specification/variable selection for risk prediction model

Hi, I am new to this forum and looking for some advice on my analysis plans.

I would like to identify clinical features/combinations of clinical features which can predict high risk of a colorectal cancer diagnosis. Clinical features will be identified from primary care data and include recorded signs, symptoms and blood test results. it is likely that there will be a high number of missing values for clinical measurements. I have identified approximately 25 plausible candidate variables which could be included in the model. I am specifically interested in assessing if the risk profiles are different by cancer subsite. I plan to do a case control study and fit a logistic regression model. Cases include patients diagnosed with cancer and controls will include patients with no diagnosis of cancer. I would analyse predictors of cancer overall and then by subsite.

Previous studies focusing on clinical features as predictors of cancer have used stepwise techniques (bivariate screening, forward/backward selection) however I want to avoid this given the known flaws of these methods.

Using prior knowledge, all the 20 variables could plausibly be important predictors, particularly when in combination. However including all of them is likely to produce an overfitted model (relatively low event rate compared to number of predictors).

I haven’t used LASSO regression before but from my initial reading think it may be a suitable alternative to stepwise selection. Are there any other suggestions?

I also anticipate that problems may arise due to multicollinearity- do you have any suggestions on how to deal with this?

Many thanks


Thanks for joining the site Ruth.

Case-control studies have an inherent problem that you lose the “base rate”, i.e., model intercept, so it is not possible to predict absolute risk but only relative disease odds, absent an external calibration sample of some sort. There is also a significant problem in choosing appropriate controls.

Turning to the analysis, there is not method of feature selection, including lasso, that has a high probability of finding the truly “right” predictors. The best you can do is to get bootstrap confidence intervals or Bayesian credible intervals on the importance rankings of predictors (say in terms of explained variation in Y). The feature selection task is difficult, so these confidence intervals will be disappointingly wide, especially when collinearity is present.

I think that data reduction is the best bet for your situation. Sparse principal components is promising, or you can do variable clustering using a similarity measure such as squared Spearman \rho. Variables that compete with each other can be scored into a single dimension. This brings a lot of stability to the analysis and increases interpretability. My RMS book and course notes goes into detail. I also recommend Ewout Steyeberg’s Clinical Prediction Modeling book.

Data reduction (unsupervised learning) allows a reduction in the number of parameters to estimate against the outcome so as to have no overfitting, without needing to use special penalization software.


Thank you for your reply. I have a lot of reading to do to understand the suggested methods! I didn’t mention that the vast majority of the candidate predictors are binary (record of symptom/no record of symptom)- will this be problematic if using PCA?

Regarding choosing appropriate controls the plan is to match on age, gender and GP practice.

I Initially planned to use multinomial logistic regression. However, given that I am interested in identifying possible differences in risk profiles for colorectal cancer by subsite (proximal and distal), should I be building 2 separate models? My reasoning behind this is that the data reduction may result in dropping different variables.


There are data reduction methods that are designed to work with purely categorical variables, and there probably are methods for mixtures of categorical and continuous variables. See here and here for some useful pointers.

We often use regular principal components even for binary data. It’s “approximately” OK.

My inclination is to try for one model but I don’t understand your setup well enough to offer an informed opinion about that.