Selection bias or good data management

I working with a dataset that is an amalgamation of patients belonging to different cohorts. I have data from 800 patients belonging to Cohort 12345, 350 patient data from Cohort 54321, 900 patient data from Cohort 5689 so on…So the dataset in the end has around 45,000 observations from 11 different cohorts.

Data from one cohort in particular is fishy. Most of the patients in that cohort have 0s for their outcome. The 0s are excessive they dont appear to be generated from a normal data generating process. These 0s are due to operator error. Something went wrong when uploading the data in the database.

I want to exclude the data from this cohort , which would be around 1200 observations. This wont make any huge change in my overall sample size, I am still left with around 40,000 observations. But what about selection bias. Would excluding the data from this cohort because of unnaturally excessive zeros introduce selection bias ? I know that conditioning on a common cause of exposure and outcome opens a path that may lead to selection bias. In this case does, excluding a cohort because the data is very unreliable , lead to conditioning on a common cause of A(Exposure) and Y(outcome) and lead to selection bias ?

Please advise. Thanks.

-Sudhi

Without more information on the selection process (site or patient/subject), I don’t think anyone can provide a “proper” full answer. Was the selection/inclusion process the same for all cohorts/sites? Would removing that single cohort somehow result in your entire dataset no longer being representative of the population under study (e.g., that cohort consisted of hospitalized/severely ill patients)? If exclusion of the site doesn’t affect randomization, there shouldn’t be a problem.

From what you say, it doesn’t sound like the data are “unreliable”, but rather outright wrong due to operator error. If there is no way to recover the true data, I think it would be worse than any (potential) effect due to bias if you were to include it.

Thanks Watson. The inclusion of various cohorts is purely administrative. This is like if you want to attend the BBQ party , bring 12 cans of beer. If a scientist wanted to participate in this NIH funded study they have to share their cohort data. So 11 scientists all over the US decide to participate in this study and each decided to share their data. Hence 11xn which resulted in 50,000 observations from 11 cohorts. This particular cohort is a mostly white women. Excluding this cohort will not drastically change the proportion of white women.

I agree including this data is more more harmful than worrying about bias.

It would seem that exclusion of data from one of the participating cohorts because the data are poor quality (are not credible) is tenable if there is reasonable certainty that the data are, in truth, “wrong.” The more that you know about why the data became “wrong,” the stronger will be the case to exclude the cohort. I would suggest that you talk to the PI (i.e., pick up the phone, write an email) and ask the PI whether there is an explanation for the data pattern that you observe.

A careful reading of the protocol for this cohort and the other 10 cohorts might identify some difference in eligibility to be in the cohort compared with the other cohorts or in follow-up methods. Again, the more you know about how the data were collected, stored, and managed and why the data are “anomalous,” the better.

Doing all analyses with and without the data from this cohort as a kind of “sensitivity analysis” is another strategy to be considered.

What you are doing would seem to have an analogy with individual patient (individual) participant) meta-analysis of clinical trials (called IPD meta-analysis or IPD MA).

Using the individual patient meta-analysis analogy, my greater concern about selection bias would be selection to provide access to data from 11 of some number of cohorts that potentially could have provided data access.

How many potentially eligible cohorts are there? How many participants are there in the cohorts that did not provide data access? Is there any information about the frequency of the outcome of interest in the non-participating cohorts?

Pertinent literature about bias in individual patient meta-analysis of clinical trials follows.

Ahmed I, Sutton AJ, Riley RD. Assessment of publication bias, selection bias, and unavailable data in meta-analyses using individual participant data: a database survey. BMJ. 2012 Jan 3;344:d7762. doi: 10.1136/bmj.d7762. PMID: 22214758.

Debray TP, Moons KG, van Valkenhoef G, Efthimiou O, Hummel N, Groenwold RH, Reitsma JB; GetReal Methods Review Group. Get real in individual participant data (IPD) meta-analysis: a review of the methodology. Res Synth Methods. 2015 Dec;6(4):293-309. doi: 10.1002/jrsm.1160. Epub 2015 Aug 19. PMID: 26287812; PMCID: PMC5042043.

Hi EpiMD5,

It would seem that exclusion of data from one of the participating cohorts because the data are poor quality (are not credible) is tenable if there is reasonable certainty that the data are, in truth, “wrong.” The more that you know about why the data became “wrong,” the stronger will be the case to exclude the cohort. I would suggest that you talk to the PI (i.e., pick up the phone, write an email) and ask the PI whether there is an explanation for the data pattern that you observe.

A careful reading of the protocol for this cohort and the other 10 cohorts might identify some difference in eligibility to be in the cohort compared with the other cohorts or in follow-up methods. Again, the more you know about how the data were collected, stored, and managed and why the data are “anomalous,” the better.

this issue was discussed with the Cohort PI and unfortunately the Cohort PI has no idea how the data got corrupted because the data came from labs and changed hands few times before it was uploaded into the database.

Sensitivity Analysis is a great suggestions. Thanks will try that.

Using the individual patient meta-analysis analogy, my greater concern about selection bias would be selection to provide access to data from 11 of some number of cohorts that potentially could have provided data access.

How many potentially eligible cohorts are there? How many participants are there in the cohorts that did not provide data access? Is there any information about the frequency of the outcome of interest in the non-participating cohorts?

No idea, we dont have information about what could have been. Its difficult to prove exchangeability if thats what you are wondering. But exclusion does introduce selection bias in this case due to measurement error.

Thanks I will go through the material on IPD-MA.

In my experience, “weird zeroes” in a dataset often arise when a spreadsheet (almost always Excel) has been used to “manage” the data either at the data entry stage or somewhere in the chain of “cleaning,” updating, merging, or uploading.

Also, in my experience, “weird zeroes” can arise if there are changes in the representation of missing data, especially when data are “managed” using spreadsheets. You might ask the PI how missing data on outcome (which I understand is the problem in your rogue cohort) were handled and whether the representation of missing data on outcome changed during the period of data collection.

If the data collection period spanned versions of a spreadsheet, missing data may have been inadvertently “converted” to a zero value when data from two versions were aggregated.

Good luck!

Ah ha, Thanks. This happened to me one earlier when I was creating a dataset for an investigator. When I created an excel file, automatically the 0s at the end of patient ID were deleted due to default excel formatting issue. There was some confusion until we identified the root cause of this problem. Also excel changes numbers to E (scientific notation). Your suggestion reminds me of this incident. I will discuss this point with my team. Thanks again.

1 Like