Hi, I am new to this forum and looking for some advice on my analysis plans.
I would like to identify clinical features/combinations of clinical features which can predict high risk of a colorectal cancer diagnosis. Clinical features will be identified from primary care data and include recorded signs, symptoms and blood test results. it is likely that there will be a high number of missing values for clinical measurements. I have identified approximately 25 plausible candidate variables which could be included in the model. I am specifically interested in assessing if the risk profiles are different by cancer subsite. I plan to do a case control study and fit a logistic regression model. Cases include patients diagnosed with cancer and controls will include patients with no diagnosis of cancer. I would analyse predictors of cancer overall and then by subsite.
Previous studies focusing on clinical features as predictors of cancer have used stepwise techniques (bivariate screening, forward/backward selection) however I want to avoid this given the known flaws of these methods.
Using prior knowledge, all the 20 variables could plausibly be important predictors, particularly when in combination. However including all of them is likely to produce an overfitted model (relatively low event rate compared to number of predictors).
I haven’t used LASSO regression before but from my initial reading think it may be a suitable alternative to stepwise selection. Are there any other suggestions?
I also anticipate that problems may arise due to multicollinearity- do you have any suggestions on how to deal with this?