Questions on modelling strategies in binary logistic regression from two papers

Sorry for the wordy post as I was trying to make my question clear.

I am searching a proper way for model specification and final model presentation.
I have known that for model specification using univariate analysis for selecting candidate factors (no matter p < 0.05 or 0.20 or 0.25) may not be a good idea while data reduction based on subject matter knowledge is considered as a superior method.

As I am studying head injury, I hope to get more insight and learn from on two landmark papers in the filed:
both contained quite high sample size for developing prognostic models.

In CRASH study, multivariable logistic regression was performed and variables that were not significant at 5% level were excluded. The variables were further quantified by their predictive contribution by its z score. Core model and CT model were then introduced (seems based on the retained variables with the Z score ranking). The internal validity of final model was assessed with bootstrap.

For IMPACT study, a multivariable model with 26 potential predictors were examined. Important variables were then seleceted according to the Nagelkerke R2 in multivariable analyses. Three levels of models were then introduced (seems based on the Nagelkerke R2 ranking in multivariable analyses). The internal validity of final model was assessed with cross-validation.

May I ask :

  1. for the internal validation, did the bootstrap and cross-validation there consider the strategy of variable selection? If not, is it ok to perform the internal validation like this or is rigious internal validation a necessity?

  2. I am curious about how to properly build multiple levels of prediction models like these two studies,
    is it a good way to do the multivariate analysis with all considered factors, rank the variables by R2 or Z score and then select variables by ranking or by their category (such as clinical examination, CT and lab variables) to form different levels of model?

Or should there be multiple “full models” to fit for each level of prediction model?
for example:

Should there be a 1st full model with candidate factors from demographics and clinical examination only to fit for further analysis, as it was designed by only applying factors in these categories.

then a 2nd full model with candidate factors from demographics, clinical examination and CT features to fit for further analysis,

and a 3rd full model with candidate factors from demographics, clinical examination, CT features and lab variables to fit for further analysis?

1 Like

Just to address part of your good questions, rigorous interval validation is a very good idea, and you need to dig into the papers to make sure that they did it correctly. All supervised learning steps must be repeated afresh for each bootstrap sample. This includes variable selection.

Most authors do not question whether variable selection is a good idea, and use it reflexively. It is usually a bad idea. And note that whether you use p-values, AIC, R^2, or a z-score it’s all really the same.


Thank you for your reply Professor Harrell. I found that the IMPACT study actually used a nested set of adjusted analyses after the univariate analysis which I have zero experience on.

For developing multiple levels of prediction models with multi-level candidate factors, would you make some suggestions on how it can be done?

Is the design I mentioned above appropriate? Prepare multiple “full models” with combinations of multi-level factors to fit for further analysis. I found it is quite heavy to perform though. If I have three-levels of variables, for each outcome, I need to fit three full models.

I was suggested to fit my full model and select those factors with good z-score or wald X2 or with p < 0.05 and to see whether the performance lost is tolerable with a simpler model. I do not like this idea, as rigorous interval validation is hard to be performed after that.

Don’t use p-values and the like for deciding on models. Model specification needs to be heavily subject-matter driven. And never do univariable screening when you can avoid it. But to your question about groups of variables, since you are grouping the variables using only subject matter knowledge, there are a lot of cool things you can do by fitting sequential models, treating each group of variables as an entity along the lines of doing chunk (composite) tests. Use the likelihood ratio \chi^2 test statistic as your measure of how much each group of variables adds to predictive information. The maximum likelihood chapter in my book goes into some detail; see for example the adequacy index and this blog.


Thank you Professor Harrell. I will start to digest:)