This is a general query I have related to developing a clinical prediction model. In a list of candidate predictors ( from extensive literature review and expert opinion, and after considering availability at the point of care and cost effective) we often find one of the predictors is very strong compared to others. What to do in that scenario? My concept is it is always better to prespecify variables and models.
In general, the design of a model asks what we do not know, in order to try to limit this uncertainty. The more information the model provides, the more useful it will be. My clinical experience over the years has been that in the presence of strong predictors, multivariable models are either less useful because the clinical scenario is already evident, or they are decoupled from decision making. For example, to predict the prognosis of pulmonary embolism, the simplified PESI score uses very strong predictors such as shock or respiratory failure. However, one does not need to study medicine to know that a patient who is asphyxiating is “high risk” versus one who is breathing normally. In contrast, a much more subtle and useful question would be to stratify prognosis in patients who are clinically stable, or at least with graded levels of hypoxemia (considering the variables as continuous). The other consequence of these strong predictors is that prognostic categories hardly translate into therapeutic categories harmoniously. This means that if the subject has only one of these variables he is already high risk and requires intensification of therapy, regardless of the rest of the model. The solution in the simplified PESI, in effect, is that the presence of any binary predictor, already defines a high-risk situation. In fact it is questionable whether you are actually predicting anything with this type of model, insofar as the endpoint is obviously concatenated with the endpoint, and in fact there is no uncertainty. For example, the MASCC score for complication prediction uses shock as a predictor.
However, the endpoint is a mixture of severe complications that also includes shock or death from septic shock, so you’re not really predicting much.
In model design, my view at this point is to select those subjects with true uncertainty, and not try to predict prognosis in subjects where the picture is already obvious.
Where I do find that careful modeling is especially helpful even when a predictor is clinically very strong is that clinicians are often right about the strength but are wrong about the shape of the relationship between predictor and outcome.
To follow on with what Frank stated, careful thinking and representation/modeling of the strong variable are in order. For example, smoking history (simply as pack-years) is a strong predictor of lung cancer. it is highly non-linear across the entire range of values. However, an argument could be made that presence of any smoking history (binary or categorical, never, ever, current) captures the generalized elevated risk that a person who smokes represents. Other measures of smoking history directly represent the predictor (e.g. pack-years) and while other factors present individual response to the exposure (e.g. COPD, lung function as FEV1, shortness of breath, etc.) each are correlated with one another to some extent, but each also add to the predictive model. Carefully consider all, your available degrees of freedom and freely expend them on this strong predictor.