Is it possible for data to be both anonymous and useful?
The hoarding of data by companies is usually accompanied by the assurance that all the obvious identifiers (name, address, Social Security number) have been deleted – i.e. that the data has been “anonymized”. The goal is to retain the usefulness of the data without endangering the privacy of the individuals it pertains to.
But is that possible? Ars technica‘s Nate Anderson shows with an example that even without the obvious identifiers, it is still possible to tie the data to an individual. Case in point – the release of “anonymized” data regarding the hospital visits of state employees by the Massachusetts Group Insurance Commission in 1990.
At the time, Latanya Sweeney, a graduate student in computer science chose to test that assertion. Combining the data in question with the data she obtained by buying the voter rolls from the city where the then-Governor of Massachusetts lived (which included names, addresses, ZIP codes, birth dates and sex of every voter), she managed to find out which records were his, using the simple method of exclusion of conflicting data.
Throughout the years, there were other examples of failed anonymization: AOL, Netflix, etc. It seems that most data can be “personal” if combined with the right amount of other relevant data. If that proves to be true, it raises some good questions: all the data that is collected in various databases around the world, if combined – what can it say about me? And can it be used for malicious purposes?
And from the companies’ and researchers’ point of view: can the information be still useful if stripped of all the potential “personal” elements?