The cost of privacy: destruction of data-mining utility in anonymized data publishing

14 years 5 months ago

Download www.cs.utexas.edu

Re-identification is a major privacy threat to public datasets containing individual records. Many privacy protection algorithms rely on generalization and suppression of "quasiidentifier" attributes such as ZIP code and birthdate. Their objective is usually syntactic sanitization: for example, kanonymity requires that each "quasi-identifier" tuple appear in at least k records, while -diversity requires that the distribution of sensitive attributes for each quasi-identifier have high entropy. The utility of sanitized data is also measured syntactically, by the number of generalization steps applied or the number of records with the same quasi-identifier. In this paper, we ask whether generalization and suppression of quasi-identifiers offer any benefits over trivial sanitization which simply separates quasi-identifiers from sensitive attributes. Previous work showed that kanonymous databases can be useful for data mining, but k-anonymization does not guarantee any ...

Justin Brickell, Vitaly Shmatikov

Real-time Traffic