I working with a dataset that is an amalgamation of patients belonging to different cohorts. I have data from 800 patients belonging to Cohort 12345, 350 patient data from Cohort 54321, 900 patient data from Cohort 5689 so on…So the dataset in the end has around 45,000 observations from 11 different cohorts.
Data from one cohort in particular is fishy. Most of the patients in that cohort have 0s for their outcome. The 0s are excessive they dont appear to be generated from a normal data generating process. These 0s are due to operator error. Something went wrong when uploading the data in the database.
I want to exclude the data from this cohort , which would be around 1200 observations. This wont make any huge change in my overall sample size, I am still left with around 40,000 observations. But what about selection bias. Would excluding the data from this cohort because of unnaturally excessive zeros introduce selection bias ? I know that conditioning on a common cause of exposure and outcome opens a path that may lead to selection bias. In this case does, excluding a cohort because the data is very unreliable , lead to conditioning on a common cause of A(Exposure) and Y(outcome) and lead to selection bias ?
Please advise. Thanks.
-Sudhi