Sorry for the wordy post as I was trying to make my question clear.
I am searching a proper way for model specification and final model presentation.
I have known that for model specification using univariate analysis for selecting candidate factors (no matter p < 0.05 or 0.20 or 0.25) may not be a good idea while data reduction based on subject matter knowledge is considered as a superior method.
As I am studying head injury, I hope to get more insight and learn from on two landmark papers in the filed:
both contained quite high sample size for developing prognostic models.
In CRASH study, multivariable logistic regression was performed and variables that were not significant at 5% level were excluded. The variables were further quantified by their predictive contribution by its z score. Core model and CT model were then introduced (seems based on the retained variables with the Z score ranking). The internal validity of final model was assessed with bootstrap.
For IMPACT study, a multivariable model with 26 potential predictors were examined. Important variables were then seleceted according to the Nagelkerke R2 in multivariable analyses. Three levels of models were then introduced (seems based on the Nagelkerke R2 ranking in multivariable analyses). The internal validity of final model was assessed with cross-validation.
May I ask :
for the internal validation, did the bootstrap and cross-validation there consider the strategy of variable selection? If not, is it ok to perform the internal validation like this or is rigious internal validation a necessity?
I am curious about how to properly build multiple levels of prediction models like these two studies,
is it a good way to do the multivariate analysis with all considered factors, rank the variables by R2 or Z score and then select variables by ranking or by their category (such as clinical examination, CT and lab variables) to form different levels of model?
Or should there be multiple “full models” to fit for each level of prediction model?
Should there be a 1st full model with candidate factors from demographics and clinical examination only to fit for further analysis, as it was designed by only applying factors in these categories.
then a 2nd full model with candidate factors from demographics, clinical examination and CT features to fit for further analysis,
and a 3rd full model with candidate factors from demographics, clinical examination, CT features and lab variables to fit for further analysis?