Improperly anonymized taxi logs reveal drivers’ identity, movements
Software developer Vijay Pandurangan has demonstrated that sometimes data anonymizing efforts made by governments and businesses are worryingly inadequate, as he managed to easily deanonymize data detailing 173 million individual trips made by New York City taxi drivers.
The data was provided to Chris Whong, an “urbanist, mapmaker, data junkie” following a Freedom of Information request, and he made it available to the public.
“Each trip record includes the pickup and dropoff location and time, anonymized hack licence number and medallion number (i.e. the taxi’s unique id number), and other metadata,” explained Pandurangan.
Government officials did, to their credit, try to anonymize the personally identifiable information (driver’s licence number and taxi number), but unfortunately they did it poorly: they used the MD5 algorithm to hash it.
“A cryptographically secure hashing function, like MD5 is a one-way function: it always turns the same input to the same output, but given the output, it’s pretty hard to figure out what the input was as long as you don’t know anything about what the input might look like. The problem, however, is that in this case we know a lot about what the inputs look like,” Pandurangan pointed out.
He knew that NYC taxi licence numbers are 6-digit numbers or 7-digit numbers starting with a 5, and the specific patterns to which taxi numbers had to conform.
He then simply calculated all the possible hashes for both numbers (in less than two minutes!), and used that list to discover the original numbers. With that information in hand he used online resources to look up the identities of the owners of medallions.
He was then effectively in the possession of information that showed the daily movements of those individuals.
“This anonymization is so poor that anyone could, with less then 2 hours work, figure which driver drove every single trip in this entire dataset. It would be even be easy to calculate drivers’ gross income, or infer where they live,” he noted, and added that using hash functions to anonymize data is not a good solution, and that fact has been proven over and over again.
He offered two alternative solutions for this particular example: assigning a random number to each hack licence number and medallion number once, and re-using it throughout the dump file, or creating a secret AES key, and encrypting each value individually.