tl;dr I want to validate registry data. I’m trying to think of a strategy to reduce the number of medical records that need to be reviewed for the validation, using a minimum effect size of interest-like sample size calculation, but can’t think of how to formulate the strategy. Any and all suggestions are appreciated, and especially any suggestions of textbooks or journal articles relevant to the problem.
The Nordic countries have cultivated individual-level population-based registries that include out-patient visits, emergency department visits, hospital admissions, and all prescriptions. Data can be linked between registries using government issued national identification numbers. In the case of Iceland, the available data includes all laboratory values. Epidemiological research is common in all countries, but unfortunately such studies often only consider International Classification of Diseases (ICD) diagnostic codes when classifying patients as either having or not having a disease. This is unfortunate as much more data is readily available, which would likely largely increase the accuracy of the classification.
Consider the case of heart failure (ICD code I50). Additional information that may increase the sensitivity, specificity, and positive and negative predictive value of registry-based ascertainment includes prescriptions for furosemide, torsemide and spironolactone, emergency department visits in which intravenous furosemide was administered or a brain natriuretic peptide was elevated, the list goes on.
I am considering launching a large-scale validation study of certain conditions in population-based registries. I would like to pre-specify several different case definitions, using mixes of ICD diagnostic codes, routine prescription data and laboratory results, and report the sensitivity, specificity, and positive and negative predictive values for each strategy. For some diagnoses, I have available prospective validated registries, and I can use these to calculate the desired indices on the patients that are included. However, for the majority of the work, I or my co-authors would need to examine a random sample of patient records that either do or do not meet the case-definition under study.
Before rushing headlong into these I need to think of a method to predict (and minimize) the number of patient records that would be needed to be manually scanned. I have a feeling that calculating the minimum required sample size would need to include a pilot study, looking at n = x number of records with and without the diagnosis, and defining a “minimum effect size of interest”-like number for the different indices. I just can’t think where to start. Any and all suggestions are appreciated.