Model selection process

I have about 30 variables that I think might be useful for modelling my outcome which is a binary variable.

Before I select my model I want to preregister. What is the standard way to do this?

For example, when selecting between model A and model B should the CV etc be done on a limited subset of the 30 variables?

My feeling is that if I do model A and then variable selection and compare it to model B after variable selection that isn’t a good comparison, nor is it if I select model B and do variable selection and then use those variables for model A.

Am I overthinking this? I see lots of once the model has been chosen how to choose variables, however less seen (or can someone reference me) is if the choice of variables is known, how to select the best model.

1 Like

You are definitely getting at a real issue that doesn’t have final answers.

  1. With binary data, 30 variables and two models to nest the variables in, I hope you have a lot of data!
  2. Including/excluding variables is a kind of model selection, especially if you consider interactions and non-parametric components that use those variables
  3. You are not over-thinking this. Many people spend a lot of effort on variable selection but pick an “standard” model… sometimes it works and sometimes it doesn’t.

You are more likely to get good answers if you can be more specific about your models/data. For example:

  1. Are any of the 30 variables ones that have historical data and scientific support (e.g.-mechanistic knowledge) that backs up their inclusion? Just include them. You don’t need to re-invent the wheel when you have historical data and scientific support.
  2. Are any of the 30 variables ones that are the core of your analysis? Just include them. You can often substitute estimation for model selection so if a variable is “included” but the effect is zero you have your answer, you don’t need to do more math/computation to arrive at a second version of the same conclusion.
  3. CV gives you an estimate of the out-of-sample prediction error, so if you choose your model based on CV with only the variable from #1 and #2 above that you decided you must include then you are assuming that further variables will not change your answer about the “best” model. That might be ok if, for example, model A includes a proportional hazards assumption whereas model B relaxes it, because once you know that your system breaks the proportional hazards assumption then adding a few covariates is unlikely to fix it.

I’m just throwing out some examples of how you could approach your question, without more information on your models/data I don’t think your questions have a straightforward answer.