Balancing confidentiality and scientific rigor with anonymized data sets

I’ve read a few reports in the past 2 years where one research team releases “anonymized” data. Later, some computer scientists see if they can develop a model that identifies individuals based upon the anonymized data. A NY Times article referred to the following paper that suggests most processes to anonymize data are easily foiled.

Estimating the success of re-identifications in incomplete datasets using generative models

Our paper shows how the likelihood of a specific individual to have been correctly re-identified can be estimated with high accuracy even when the anonymized dataset is heavily incomplete.

I was interested in the thoughts of researchers regarding the methods used.

dont know enough about it but guess it’s almost an intracatble problem eg in the extremes there are small numbers; i heard one researcher say they mess up the age a bit for the eldely. I think redcap just shifts the dates by the same amount for a given individual, since the analyst only ever uses the difference between dates which is unchanged. The ‘hacker’ would need some additional, extraneous data soucre i’d think unless they are eg the mother of a child who knows the date, weight and hospital their child was born in and thus could find them in a registry database. A lot of effort …

I saw this on Twitter yesterday and chatted about it there.

I think there are some real concerns here about privacy of data in the real world (i.e. whatever Google, Facebook, Amazon, etc have on all of us) but I do not think the top-line finding which many people are repeating is pertinent to our world re: research teams releasing anonymized data, and I’ll explain why.

Many people have had their eyes caught by this sentence: “Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes.”

Ohmygod! They can re-identify 99.98% of Americans in any dataset using 15 demographic attributes?

But wait, that’s kind of a clever wording twist. The “in any dataset” tricks you, but it would require that the dataset include not just ANY 15 demographic variables but some very specific ones: especially, note that the datasets that they used for this exercise included date of birth and zip code. Right there, we already have a huge problem in applying these findings to something like the public-access dataset from a clinical trial, because no public-access dataset from a clinical trial should include date of birth (or any dates for that matter) nor should they include a full zip code. I’m not certain but I’d give pretty good odds that date of birth is a CRITICAL piece of this puzzle. So I don’t think this “oh my god you can re-identify 99% of Americans in any dataset with demographic data” is at all reflective of the probability of someone correctly re-identifying subjects in a public-access dataset from a clinical trial.

I would like to see this team apply their methods to a public-access dataset from a clinical trial that met full HIPAA de-identification standards ( - name, geographic subdivisions smaller than a state, all elements of dates smaller than a yer, the obvious stuff (phone numbers, email addresses, IP addresses, SSN, MRN, etc), vehicle identifiers. With enough scraping, they might be able to correctly re-identify a few people, but I would be shocked if it was anything like 99% (really, I’d be shocked if it was more than 10%). So while this is going to get a lot of press, I don’t think (in its current form) it has much application to whether it’s okay to release a public access dataset from a clinical trial that’s de-identified to the standard described above. If they redid this work and showed me that they could correctly re-identify a high percentage of people on a public-access dataset from a clinical trial that had the de-identification described above, I might revisit this position.

EDIT: they actually provide a little tool where you can check your probability of being correctly identified with / without selections of 7 attributes (gender, DOB, marital status, zip code, number of vehicles, work status, and house ownership). I played some with my own attributes and slight variations, and any time you take DOB out there is a HUGE reduction (like, it goes from “99%” to “0%” just by taking DOB out).

1 Like

Thanks. I couldn’t find any obvious flaws, but I figured I was missing something.

I thought one could use a Bayesian model to figure something like this out. It sounds as if the “15 variable model” is much less impressive when most of the information is contained in 1 or 2 variables. You don’t need “machine learning” in that scenario.

Brief clarification: it may be a little tricky to say that “most of” the information is contained in 1 or 2 variables, because they probably need the date of birth in combination with the other variables to uniquely identify people (e.g. “just” having the DOB doesn’t help unless you have the other info as well). My principal issue with the way this is being reported is that people are interpreting this basically as “Nothing is safe! Any de-identified dataset can be re-identified!” when from what I can tell a large chunk of that ability to re-identify rests on the dataset including things that realistically should not be in a dataset that is properly de-identified.

Before we (the medical research community) decide that we have to give up on ever putting public-access datasets out there, I’d like to see this done on an example of a clinical trial dataset with the HIPAA specified identifiers all taken out.

Anyone interested in the formal modelling of these issues might find these links useful. This would be a great area for statisticians and computer scientists to collaborate.

Computer scientists use the term Differential Privacy in their research. Much of the work is done from an information theory perspective.

A good lay person description is here:

The author of the blog post (a cryptographer) gave a very interesting example of a model involving medicine (Warafin dosage).

The lesson of this research is simply that there are interesting tradeoffs between effectiveness and the privacy protection given by any DP-based system — these tradeoffs depend to a great degree on specific decisions made by the system designers

Here is a paper describing some fundamental results from a theoretical computer science perspective:

A more mathematically formal presentation of the issues by the prior author is found in the following text (PDF available through link):

Algorithmic Foundations of Differential Privacy (link)

WIkipedia has a good reference list to start exploring:

The way the computer scientists look at the problem is the linkage of different data sets, independent parties.