Validation of population-based registries

tl;dr I want to validate registry data. I’m trying to think of a strategy to reduce the number of medical records that need to be reviewed for the validation, using a minimum effect size of interest-like sample size calculation, but can’t think of how to formulate the strategy. Any and all suggestions are appreciated, and especially any suggestions of textbooks or journal articles relevant to the problem.

The Nordic countries have cultivated individual-level population-based registries that include out-patient visits, emergency department visits, hospital admissions, and all prescriptions. Data can be linked between registries using government issued national identification numbers. In the case of Iceland, the available data includes all laboratory values. Epidemiological research is common in all countries, but unfortunately such studies often only consider International Classification of Diseases (ICD) diagnostic codes when classifying patients as either having or not having a disease. This is unfortunate as much more data is readily available, which would likely largely increase the accuracy of the classification.

Consider the case of heart failure (ICD code I50). Additional information that may increase the sensitivity, specificity, and positive and negative predictive value of registry-based ascertainment includes prescriptions for furosemide, torsemide and spironolactone, emergency department visits in which intravenous furosemide was administered or a brain natriuretic peptide was elevated, the list goes on.

I am considering launching a large-scale validation study of certain conditions in population-based registries. I would like to pre-specify several different case definitions, using mixes of ICD diagnostic codes, routine prescription data and laboratory results, and report the sensitivity, specificity, and positive and negative predictive values for each strategy. For some diagnoses, I have available prospective validated registries, and I can use these to calculate the desired indices on the patients that are included. However, for the majority of the work, I or my co-authors would need to examine a random sample of patient records that either do or do not meet the case-definition under study.

Before rushing headlong into these I need to think of a method to predict (and minimize) the number of patient records that would be needed to be manually scanned. I have a feeling that calculating the minimum required sample size would need to include a pilot study, looking at n = x number of records with and without the diagnosis, and defining a “minimum effect size of interest”-like number for the different indices. I just can’t think where to start. Any and all suggestions are appreciated.

1 Like

A good starting point is to compute the sample size needed to estimate a simple proportion very well, whether it be a proportion of patients for which two things agreed, a sensitivity, or a specificity. To estimate a proportion with a margin of error of no more than 0.05 requires 384 patients. BBR has some details.

1 Like
  • Are diagnoses qualified with meta-information? For example, is it clear if a diagnosis was merely suspected at the time of hospitalisation, or was it already established or confirmed during hospitalisation? It seems EHR systems capture such different types of diagnoses, which may be needed for reimbursement and administrative purposes.
  • Perhaps the question of validation can in this context be linked to that of completeness of registries (accurately capturing population features)? See for methodology discussions for example Estimating completeness in cancer registries–comparing capture-recapture methods in a simulation study.
1 Like

I have a couple of ideas, the first being a variation on Frank’s idea.

The first would be to use the Rule of Three on a random sample of patients to assess for errors in records. The Rule of Three essentially states that if you review a random sample of N records and do not identify any errors, the 95% confidence intervals for the incidence of errors in the population of records are from 0 to 3/N.

So, if you review 500 randomly sampled records, for example, and do not observe any errors in those 500 records, then the 95% confidence intervals for the the 0% incidence of errors in the total population record set would be 0 - 0.6%. Obviously, adjust the random sample size to a level where the upper bound of the 95% confidence intervals of the error incidence is an acceptable value.

The random sample could be from all patients, or a random sample for each of selected diagnostic codes, depending upon the resources that you have to allocate to the project.

The second approach would be to use something akin to Risk Based Monitoring (RBM), which would require more up front work, but would take a probabilistic approach to identifying and then reviewing records that are at a “high risk” of being problematic. The goal of RBM in clinical trials has been to get away from 100% source data validation (SDV), using a combination of approaches to identify records and/or sites that would require manual review, to reduce the scope and cost of SDV.

This would require access to the actual data files and up front work to program the heuristics which would then be applied to all records. The heuristics would contain patient level and perhaps site level (if that makes sense) logic checks on the data, to look for missing data, clinically inconsistent data and other relevant checks, where records that fail the checks would be targeted for a manual review. That would be consistent with the examples you gave for CHF above, looking for associated medications and other factors that would be expected, given a specific diagnosis. That process would then ideally give you some idea as to the possible incidence of errors in the population source data, and you can then take whatever action would be apropos.

At the end of the day, if you are going to do something short of 100% SDV, any approach you take will be probabilistic in inference regarding the incidence of errors in the population of records, so it may come down to the resources that you will be able to apply to the task.

As you note, a pilot study may be helpful in providing some level of guidance in grossly estimating an underlying error incidence, that may in turn, influence your path forward. The Rule of Three approach would logically be premised on the notion that the inherent error incidence is low, or at least below some maximum acceptable level. If you did a pilot study of say 50 randomly sampled records and did not find any errors, your upper bound of the 95% confidence intervals would be 6%. On the other hand, if in that sample, you identified some number of errors, that might suggest that the Rule of Three approach would not be viable, if an error incidence that is likely above 6% is not acceptable.