Missing inclusion criteria data in registry-based studies

martinspn · May 17, 2024, 12:33pm

I’ve been working with registry data and sometimes one of the clinical baseline variables (pre-stroke modified Rankin Score - mRS) is missing. The mRS after 90 days is also the most common outcome in our field.

The usual practice in stroke registry-based studies is to not include such cases in the analyses. However, most (>90%) patients who have this information are at the levels 0, 1 or 2 at baseline, meeting the widespread inclusion criteria. Also, we plan to use it as a covariate, not the exposure of interest. Both the exposure of interest and other covariates have some low-frequency levels that could benefit from including some more dozens of patients.

The main reason for missingness is the impossibility to collect at a first moment (e.g., emergency doctors couldn’t reach out to the patient’s family) or uncertainty from the information provided on admission. I don’t see a clear reason why the missingness could be related to worse baseline status - actually, the patients who are more dependent are usually accompanied by someone since they need closer attention.

That being said,

Would it be appropriate to include the cases missing baseline mRS and then perform multiple imputation? How do you see the potential harms from including those patients versus conditioning the analysis on the non-missing values?
Does the answer to the question above depend on the percentage of missing data? If so, how would you handle <5%, 5-10%, 10-15% and >15%?

f2harrell · May 17, 2024, 12:47pm

This is a place where multiple imputation really shines. For a given imputation you fetch the imputed values of the variables used in the inclusion criteria, and include those patients meeting the criteria. This takes into account the uncertainties in the entry criteria. You can have much different sample sizes (after exclusions) for different imputations but that’s OK.

martinspn · May 17, 2024, 12:54pm

Wow. I would never imagine that.

Would that apply as a general rule for other missing criteria, as long as the missing proportion is not excessively high?
Do you have any literature recommendation to back this approach?
I guess including all patients in the imputation algorithm (regardless of baseline mRS), but only including those with mRS 0, 1, 2 or NA in the the flowchart and table 1 would be the way to go, is that right?

Edit: I’ve just found that people use “impute-then-exclude” and “exclude-then-impute” when referring to that. It is an active area of research.

This paper by Peter Austin, D. Giardello and S. Van Buuren (mice’s mantainer) compares both strategies for several scenarios and demonstrates that the exclude-then-impute approach biases the estimate for the variable of interest.

The paper addresses the case when the missing variable is also the exposure of interest - it’s not exactly my situation (variable as a covariate), but I imagine mine would be even less stringent since we are not looking for this variable’s coefficient.

f2harrell · May 17, 2024, 10:20pm

I hope someone can answer 2. Yes it’s a general approach and would work no matter how high the proportion of missing. That’s because it’s better than alternative approaches, not because it’s perfect. One 3., yes to the first part. For the second part, table 1 needs to be dynamic. If you had to pick a static table 1 you might include all patients who had a probability of meeting entry criteria exceeding 0.8.