Prof. Harrell, I am building a predictive model for stroke and have a question on inclusion/exclusion criteria.
I am trying to decide if I mantain participants with previous stroke in my dataset, and therefore use previous stroke as a predictor.
One member of my research group vaguely remembers that in one of your lectures that you mentioned an article that had a prescribe model for stroke that did not include previous stroke among the predictors, which was not considered appropriate by you since it is one of the most important predictors. I apologize if their recollection is not accurate. I could not locate this lecture to review the discussion, unfortunately, but we had extensive discussions on this.
I kindly wanted to gather your thoughts on this issue, if possible. Is it better to include participants with previous stroke, so we can use this variable as a predictor? I am afraid this may lead to survival bias, since only people who survived a stroke would be included.
Ignoring the issue of survival bias, if the sample to be analyzed includes people with previous stroke it is imperative to adjust for previous stroke as a covariate. This is best done by adjusting for the severity of previous stroke, and how long ago it happened.
Thank you very much for your quick response, Professor Harrell!
In my project (stroke event prediction using machine learning techniques), I have the option of include or not include participants with a history of stroke in my sample. My sample is very large and I have many variables.
I am trying to decide what would be the best option in terms of model quality/bias: to include these participants in the sample or not to include them?
I ask this question considering precisely the survival bias and, in addition, considering that I can include the presence of a previous stroke as a predictor variable, but I will not have the variables of severity of the previous stroke and time elapsed since the previous stroke.
You are building a predictive model, presumably for real world use. I could be misunderstanding the question, but I do not believe this is a statistical question.
Do you want the model to be usable for those patients who have a history of stroke? If so, include prior stroke as a predictor.
Do you want the model to only be usable for patients with no prior history of stroke? Then exclude prior stroke as a predictor.
In any case I recommend the TRIPOD-AI reporting guidance https://www.bmj.com/content/385/bmj-2023-078378 . Ostensibly, this is a reporting checklist but I is also an excellent resource for any predictive modeling endeavor.
Hi, Elias, thank you very much for your response and your considerations!
I believe that, even if I want to apply the predictive model in the real world, it still involves statistical issues, including biases to be avoided.
Considering the question you asked, I would like to build a predictive model using machine learning techniques that performs well in predicting this event. Including or not including participants with a previous history of stroke is not part of the objective itself, but rather a decision on inclusion/exclusion criteria that I would like to make considering what is most correct to be done in my scenario (that is, considering all the points I mentioned in my second message in this sequence).
For example, if, in order to avoid survival bias, the ideal is to remove cases with a previous history of stroke, then I understand that I will need to build a model only for participants without a previous history of stroke. And this makes me think that if I also wanted to be able to predict stroke events for people with a prior history of stroke, rather than simply including them in the same sample as participants without a prior history of stroke, I should work on a separate predictive modeling specific to this population. That is, I would have one model for participants with no prior history of stroke and another for participants with a prior history of stroke.
Applying a model includes some kind of treatment that for the most part is not being modeled directly.
I suggest to focus on two main types of problems: Resource-Constraint (therefore prioritizing high-risk patients assuming equivalence to high-benefit) or Treatment-Harm (why not giving everyone a treatment? because of the harm).
Without a clear context of decision-making accuracy means nothing really.
Iām not sure I understand how survivorship bias could possibly apply to a prediction model.
You would never use a model to predict future stroke risk in a patient who has already died, regardless of whether the previously dead patient died of a stroke or something else.
The cohort of individuals who survive a stroke are the only possible type of patient with a history of stroke in whom a prediction model could ever be applied.
Survivorship bias is therefore irrelevant, and to the question at hand, non-existant as an applicable bias.
How you include prior stroke as a predictor is a statistical issue but not the decision to include it. I can guarantee that it will be a powerful predictor of the outcome, but more importantly, it is neccessary to include prior stroke if you have any intention of anyone ever even considering using the prediction model. If this is not intuitive to you I would strongly recommend collaborating with clinicians.