Does removing other events from training data amount to data snooping?

hanowell · February 8, 2021, 6:06pm

Suppose you are interested in probability of death within X days due to cause A, which competes with cause B. Suppose you’ve convinced yourself (wrongly, in my opinion) that you will draw a cleaner comparison between survivors of A and those who die of cause A by removing all observed cases of death due to cause B before building your probability prediction model. Then suppose that you calculate some cross-validated prediction performance metrics, and assume these are proper scoring rules so we don’t go off on a tangent there. Finally, suppose that your performance metrics when you remove deaths due to cause B are better than if you don’t remove them. Aside from the obvious problems (introducing selection bias and comparing two models that aren’t even built using the same data), doesn’t this practice amount to data snooping by incorporating post-prediction information (i.e., nobody in the training or validation data died due to cause B, something you couldn’t possibly know when formulating your predictions on new data)?

f2harrell · February 9, 2021, 1:08pm

I think it boils down to what you really want to estimate. Cause removal and competing events are complex issues that yield final results that are very hard to interpret. I think a state transition model is much more natural. Such a model makes it explicit what are the probabilities you are estimating, and handle absorbing states exactly. Some material about this is available here.

hanowell · February 9, 2021, 5:43pm

In what ways does the answer to the question (“Is this data snooping?”) depend on what I really want to estimate? My initial opinion is that deleting the observations where there is death due to cause B incorporates information that one couldn’t possibly know when using the model to predict the cumulative chance of death due to cause A within X days. I assert (admittedly without looking into it with simulation or anything) that this practice leads to inflated model evaluation scores due to the use of post-prediction information. Therefore, it’s data snooping.

SIDE NOTE: This is an abstraction of a real conversation I’ve had with someone about their work. In this case, however, there are even more issues than data snooping. The purpose of the analysis is actually causal inference, the estimand being the risk difference. I’ve argued that deleting deaths due to cause B from the training data induces selection bias, makes unrealistic assumptions about the intervention of interest (i.e., that it is coupled with a removal of the risk of death due to cause B, as Young et al. argue), and fails to address indirect effects on cause A risk through effects on cause B risk. One defense I’ve received from the other party is that their model’s evaluation scores are higher when they remove death due to cause B than if they don’t. In response, I’ve said, first off, that this is incorrect, because they’re evaluating scores calculating on models fit to different data sets. Second, predictive performance doesn’t address the issue of selection bias I raised. Lastly, I’ve said that their practice of deleting deaths due to cause B is cheating anyway, as it is effectively data snooping due to the incorporation of post-prediction information.

f2harrell · February 10, 2021, 1:16pm

These are all good points but instead of data snooping I’d use descriptions like creates bias, creates hard-to-interpret estimates. I would much rather see formal modeling that respects the situation, e.g. state transition models.

What about the study will allow one to do causal inference? Why automatically switch to absolute risk difference just because you are interested in causal inference?

hanowell · February 10, 2021, 5:06pm

" These are all good points but instead of data snooping I’d use descriptions like creates bias, creates hard-to-interpret estimates." I’ve used this language for the most part in our conversations. This post about data snooping was my own curiosity about whether I’m right that one source of bias is the use of post-prediction information.

“I would much rather see formal modeling that respects the situation, e.g. state transition models.” 100% agree. Transition models unlock all the machinery of cumulative risk analysis, including emerging methods to separate direct and indirect effects.

“What about the study will allow one to do causal inference?”

“Why automatically switch to absolute risk difference just because you are interested in causal inference?” The estimand doesn’t feel crucial to this conversation I’m having with these researchers since their problem is rooted in data collection, and would remain even if they switched to another causal estimand, or even to another statistical modeling framework entirely like event history analysis (as opposed to their current limited single event probability prediction approach).