際際滷

際際滷Share a Scribd company logo
Large Scale Threats to Data Anonymity Arvind Narayanan Joint work with Vitaly Shmatikov Kamalika Chaudhuri
Anonymity is not cryptography Small keyspace  random guessing succeeds with probability 1/N  Natural upper bound on N   the race is over ! Guess-and-verify paradigm Even quadratic algorithms sometimes feasible! Conventional wisdom relied on computational infeasibility
The curse of dimensionality Too much entropy per record How high is high?  Try 35,540! k-anonymity breaks down Nearest neigbhor too far Cinematch beats baseline by 1%! Projection to low dimensions loses most of the info
Auxiliary information Auxiliary information about people very easy to obtain Unlinkability of user traces  unaffordable luxury Yet linking across databases often disastrous Future privacy   linkage of profile to identify makes virtual identities impossible
Two fallacies Identifying vs. non-identifying attributes All  attributes are quasi-identifiers! Simply removing record labels is  not  sufficient Perturbation makes attackers task harder Note superficial similarity with LPN But non-cryptographic! Reality: re-identification algorithms easily made noise resilient
Interactive protocols Severe computational limits Query-execute-analyze cycle Utility required may be non-statistical Database may even be non-relational Privacy for queries Data aggregator not trusted Algorithms in distributed setting not well developed yet
Sad realization #2 Privacy usually an afterthought (not important until it affects you) Video privacy act example Privacy vs. utility: Collect/release the data, ask questions later
Sweeney  linking (exact match) (Anonymous) (Non-anonymous) Hardly secret Probably not secret
Collaborative filtering: profiles Each of N users has a preference vector, or a  preference   profile One attribute for each item Goal: mine this database to predict preferences for new items Can we release an anonymized database of preference vectors?
Movielens  fuzzy match Hypothetical investigation Frankowski, Cosley, Sen, Terveen, Riedl.  Anonymized database of movie ratings Attacker knows small number of approximate preferences Nearest neighbor stats confirmed
Netflix  fuzzy match with noise Nearest neighbor graph Real attack, Narayanan & Shmatikov ~ 4 movies  ->  unique re-identification know either ratings or dates approximately one of the data points can be completely wrong Found a couple of our friends Found a couple of users from IMDb
Distance to nearest neighbor
Netflixs take on privacy Even if, for example, you knew  all  your own ratings and their dates you  probably  couldnt identify them reliably in the data because only a small sample was included (less than  one-tenth  of our complete dataset) and that data was subject to  perturbation . Of course, since you know all  your own  ratings that really isnt a privacy problem is it?  -- Netflix Prize FAQ
Example of deanonymization
Netflix  contributions Scoring tolerates large amount of noise  i     M   M  [ e - 留 |r i  - r i |  +  c   e - 硫 |d i  - d i |  +     ] / log #i Verifying deanonymization in absence of oracle [score(max)  score(max2)] / std.dev(score) Extract user relationships
Netflix customers with distance < 0.15 Could edges reflect real-life relationships? Ratings and dates were ignored
Recommenders: stronger attacks Do recommendation systems  inherently  leak profile? No data release! Theoretical attacks known Textbook systems Deployed, complex systems
Social networks Graph of interactions between people Think of phone call graphs Different type of profile Non  relational data
Backstrom, Dwork, Kleinberg Active and passive attacks Re-identify nodes touched by malicious edges Easy to find graph-structured patterns in large database
Narayanan, Chaudhuri Tolerates noise Several attacks where a user can re-identify own node Subgraph isomorphism with several hundred nodes Heuristics involving node labels User knows own degree exactly Modern phones store all calls Who deletes email anymore?
Finding yourself N instances of graph isomorphism Use isomorphism-invariant signatures
油
Propagation of node re-identification Surprisingly small number of seeds (6-12) Large fraction of nodes compromised Works even when large fraction (say 80%) of nodes are honest
油
油
Propagation  implementation Social phishing Buddy zoo Skype worm Online addressbook service Competing social network
Author identification Basically, a solved problem However, most studies use a small set of authors Not clear how well sample size required scales Combine with typing pattern profiling Possibly deanonymize among millions/billions of users Example: oppressive country
Genome anonymity Rich social network ~10^8 bits entropy per record Labeled sample compromises privacy of blood relatives Crossover happens in precise, elegant way Work on admixing populations Story of deanonymization of sperm donor Ease of obtaining auxiliary data or anonymous samples
Genome and DNA databases Hapmap  entire genome  Family tree services 1/800 births from anonymous sperm donor
Hapmaps take on privacy The samples are anonymous with regard to individual identity. Samples cannot be connected to individuals, and no personal information is linked to any sample.  As an additional safeguard, more samples were collected from each population than were used, so no one knows whether any particular person's DNA is included in the study.
Trait油油油油油油油油油油油油油油油油油油 Genes油油油油油油油 Chromosome location Hair/iris color油油油油油油油油油油油油 ASIP油油油油油油油油油油油油油油油  20 q11.2 Hair/iris color油油油油油油油油油油油油 DCT油油油油油油油油油油油油油油油 13 q32 Green/blue iris油油油油油油油油油油油 EYCL1油油油油油油 油油油油  19 p13.1-q13.11 Brown/blue iris油油油油油油油油油油 EYCL3油油油油油油油油油油油  15 q11-q15  * Height 油油油油油油油油油油油油油油油油油油油油油油油油 GH1油油油油油油油油油油油油油油油  17 q22-q24 Height (Laron)油油油油油油油油油油油油 GHR油油油油油油油油油油油油油油油 油  5 p13-p12 Brown/blond hair油油油油油油油 HCL1油油 油油油油油油油油油油  19 p13.1-q13.11 Brown/blond hair油油油油油油油 HCL3 油油油油油油油油油油油油  15 q11-q15 油 * Brown/red hair油油油油油油油油油油油 HCL2 油油油油油油油油油油油 油油 4 q28-q31 Hair/iris color油油油油油油油油油油油油 HPS1油油油油油油油油油油油油油油 10 q23.1-23.3 Hair/iris color油油油油油油油油油油油油 HPS2油油油油油油油油油油油油油油 10 q24.32 Skin&hair color油油油油油油油油油 MC1R油油油油油油油油油油油油  16 q24.3 Height (Marfan)油油油油油油油油油 MFS 油油油油油油油油油油油油油油  15 q21.1 Hair/iris color油油油油油油油油油油油油 MITF油油油油油油油油油油油油油油 油 3 p12.3-14.1 Hair/iris color油油油油油油油油油油油油 MYO5A油油油油油油油油油 15 q21 Ocular albinism油油油油油油油油油油 OA1油油油油油油油油油油油油油油油  X p22.3 油  Ocular albinism油油油油油油油油油油 OA2油油油油油油油油油油油油油油油  X p11.4-p11.23 OcculoCut.Albinism油油油 OCA2油油油油油油油油油油油油油  15 q11.2-q12 油  Hair/iris color油油油油油油油油油油油油 PMOC油油油油油油油油油油油油 油 2 p23.3 Hair/iris color油油油油油油油油油油油油 RAB27A油油油油油油油油 15 q15-21.1 Hair/iris color油油油油油油油油油油油油 SILV油油油油油油油油油油油油油油油 12 q13-q14  Skin color油油油油油油油油油油油油油油油油油油油 SLC24A5油油油油油油油  15 q21.1  A111T dark to light skin Short Stature油油油油油油油油油油油油油油油 SS油油油油油油油油油油油油油油油油油油油  X&Y p Hair/iris color油油油油油油油油油油油油 TYR油油油油油油油油油油油油油油油 11 q14-q21 Hair/iris color油油油油油油油油油油油油 TYRP1油油油油油油油油油油油 油 9 p23
Genotype  phenotype mappings The medical community finds  genotype  ->  phenotype mappings  Mappings being generated at an explosive rate But also: [Sweeney02]:  Inferring genotype from clinical phenotype through a knowledge based algorithm focuses on pathological phenotypes
Underlying social network
Big picture Attacks against  a wide spectrum of rich, high-dimensional datasets Can  we win the battle? Using technology alone? What if we dont? Is part of it already lost?
Thanks for coming.
Current work Sweeney  exact match Movielens  fuzzy match Netflix  fuzzy match with noise AOL BDK07  match on non-relational data NC07  non-relational data with noise Amazon  fuzzy match with noise on utility oracle Genome  match based on multiple databases Genome  phenotype/genotype mapping
Future work Author identification Combine with typing pattern profiling Oppressive country example Genome reidentification based on observables Underlying social network SAT solver  generic matching

More Related Content

Anonymity

  • 1. Large Scale Threats to Data Anonymity Arvind Narayanan Joint work with Vitaly Shmatikov Kamalika Chaudhuri
  • 2. Anonymity is not cryptography Small keyspace random guessing succeeds with probability 1/N Natural upper bound on N the race is over ! Guess-and-verify paradigm Even quadratic algorithms sometimes feasible! Conventional wisdom relied on computational infeasibility
  • 3. The curse of dimensionality Too much entropy per record How high is high? Try 35,540! k-anonymity breaks down Nearest neigbhor too far Cinematch beats baseline by 1%! Projection to low dimensions loses most of the info
  • 4. Auxiliary information Auxiliary information about people very easy to obtain Unlinkability of user traces unaffordable luxury Yet linking across databases often disastrous Future privacy linkage of profile to identify makes virtual identities impossible
  • 5. Two fallacies Identifying vs. non-identifying attributes All attributes are quasi-identifiers! Simply removing record labels is not sufficient Perturbation makes attackers task harder Note superficial similarity with LPN But non-cryptographic! Reality: re-identification algorithms easily made noise resilient
  • 6. Interactive protocols Severe computational limits Query-execute-analyze cycle Utility required may be non-statistical Database may even be non-relational Privacy for queries Data aggregator not trusted Algorithms in distributed setting not well developed yet
  • 7. Sad realization #2 Privacy usually an afterthought (not important until it affects you) Video privacy act example Privacy vs. utility: Collect/release the data, ask questions later
  • 8. Sweeney linking (exact match) (Anonymous) (Non-anonymous) Hardly secret Probably not secret
  • 9. Collaborative filtering: profiles Each of N users has a preference vector, or a preference profile One attribute for each item Goal: mine this database to predict preferences for new items Can we release an anonymized database of preference vectors?
  • 10. Movielens fuzzy match Hypothetical investigation Frankowski, Cosley, Sen, Terveen, Riedl. Anonymized database of movie ratings Attacker knows small number of approximate preferences Nearest neighbor stats confirmed
  • 11. Netflix fuzzy match with noise Nearest neighbor graph Real attack, Narayanan & Shmatikov ~ 4 movies -> unique re-identification know either ratings or dates approximately one of the data points can be completely wrong Found a couple of our friends Found a couple of users from IMDb
  • 13. Netflixs take on privacy Even if, for example, you knew all your own ratings and their dates you probably couldnt identify them reliably in the data because only a small sample was included (less than one-tenth of our complete dataset) and that data was subject to perturbation . Of course, since you know all your own ratings that really isnt a privacy problem is it? -- Netflix Prize FAQ
  • 15. Netflix contributions Scoring tolerates large amount of noise i M M [ e - 留 |r i - r i | + c e - 硫 |d i - d i | + ] / log #i Verifying deanonymization in absence of oracle [score(max) score(max2)] / std.dev(score) Extract user relationships
  • 16. Netflix customers with distance < 0.15 Could edges reflect real-life relationships? Ratings and dates were ignored
  • 17. Recommenders: stronger attacks Do recommendation systems inherently leak profile? No data release! Theoretical attacks known Textbook systems Deployed, complex systems
  • 18. Social networks Graph of interactions between people Think of phone call graphs Different type of profile Non relational data
  • 19. Backstrom, Dwork, Kleinberg Active and passive attacks Re-identify nodes touched by malicious edges Easy to find graph-structured patterns in large database
  • 20. Narayanan, Chaudhuri Tolerates noise Several attacks where a user can re-identify own node Subgraph isomorphism with several hundred nodes Heuristics involving node labels User knows own degree exactly Modern phones store all calls Who deletes email anymore?
  • 21. Finding yourself N instances of graph isomorphism Use isomorphism-invariant signatures
  • 22.
  • 23. Propagation of node re-identification Surprisingly small number of seeds (6-12) Large fraction of nodes compromised Works even when large fraction (say 80%) of nodes are honest
  • 24.
  • 25.
  • 26. Propagation implementation Social phishing Buddy zoo Skype worm Online addressbook service Competing social network
  • 27. Author identification Basically, a solved problem However, most studies use a small set of authors Not clear how well sample size required scales Combine with typing pattern profiling Possibly deanonymize among millions/billions of users Example: oppressive country
  • 28. Genome anonymity Rich social network ~10^8 bits entropy per record Labeled sample compromises privacy of blood relatives Crossover happens in precise, elegant way Work on admixing populations Story of deanonymization of sperm donor Ease of obtaining auxiliary data or anonymous samples
  • 29. Genome and DNA databases Hapmap entire genome Family tree services 1/800 births from anonymous sperm donor
  • 30. Hapmaps take on privacy The samples are anonymous with regard to individual identity. Samples cannot be connected to individuals, and no personal information is linked to any sample. As an additional safeguard, more samples were collected from each population than were used, so no one knows whether any particular person's DNA is included in the study.
  • 31. Trait油油油油油油油油油油油油油油油油油油 Genes油油油油油油油 Chromosome location Hair/iris color油油油油油油油油油油油油 ASIP油油油油油油油油油油油油油油油 20 q11.2 Hair/iris color油油油油油油油油油油油油 DCT油油油油油油油油油油油油油油油 13 q32 Green/blue iris油油油油油油油油油油油 EYCL1油油油油油油 油油油油 19 p13.1-q13.11 Brown/blue iris油油油油油油油油油油 EYCL3油油油油油油油油油油油 15 q11-q15 * Height 油油油油油油油油油油油油油油油油油油油油油油油油 GH1油油油油油油油油油油油油油油油 17 q22-q24 Height (Laron)油油油油油油油油油油油油 GHR油油油油油油油油油油油油油油油 油 5 p13-p12 Brown/blond hair油油油油油油油 HCL1油油 油油油油油油油油油油 19 p13.1-q13.11 Brown/blond hair油油油油油油油 HCL3 油油油油油油油油油油油油 15 q11-q15 油 * Brown/red hair油油油油油油油油油油油 HCL2 油油油油油油油油油油油 油油 4 q28-q31 Hair/iris color油油油油油油油油油油油油 HPS1油油油油油油油油油油油油油油 10 q23.1-23.3 Hair/iris color油油油油油油油油油油油油 HPS2油油油油油油油油油油油油油油 10 q24.32 Skin&hair color油油油油油油油油油 MC1R油油油油油油油油油油油油 16 q24.3 Height (Marfan)油油油油油油油油油 MFS 油油油油油油油油油油油油油油 15 q21.1 Hair/iris color油油油油油油油油油油油油 MITF油油油油油油油油油油油油油油 油 3 p12.3-14.1 Hair/iris color油油油油油油油油油油油油 MYO5A油油油油油油油油油 15 q21 Ocular albinism油油油油油油油油油油 OA1油油油油油油油油油油油油油油油 X p22.3 油 Ocular albinism油油油油油油油油油油 OA2油油油油油油油油油油油油油油油 X p11.4-p11.23 OcculoCut.Albinism油油油 OCA2油油油油油油油油油油油油油 15 q11.2-q12 油 Hair/iris color油油油油油油油油油油油油 PMOC油油油油油油油油油油油油 油 2 p23.3 Hair/iris color油油油油油油油油油油油油 RAB27A油油油油油油油油 15 q15-21.1 Hair/iris color油油油油油油油油油油油油 SILV油油油油油油油油油油油油油油油 12 q13-q14 Skin color油油油油油油油油油油油油油油油油油油油 SLC24A5油油油油油油油 15 q21.1 A111T dark to light skin Short Stature油油油油油油油油油油油油油油油 SS油油油油油油油油油油油油油油油油油油油 X&Y p Hair/iris color油油油油油油油油油油油油 TYR油油油油油油油油油油油油油油油 11 q14-q21 Hair/iris color油油油油油油油油油油油油 TYRP1油油油油油油油油油油油 油 9 p23
  • 32. Genotype phenotype mappings The medical community finds genotype -> phenotype mappings Mappings being generated at an explosive rate But also: [Sweeney02]: Inferring genotype from clinical phenotype through a knowledge based algorithm focuses on pathological phenotypes
  • 34. Big picture Attacks against a wide spectrum of rich, high-dimensional datasets Can we win the battle? Using technology alone? What if we dont? Is part of it already lost?
  • 36. Current work Sweeney exact match Movielens fuzzy match Netflix fuzzy match with noise AOL BDK07 match on non-relational data NC07 non-relational data with noise Amazon fuzzy match with noise on utility oracle Genome match based on multiple databases Genome phenotype/genotype mapping
  • 37. Future work Author identification Combine with typing pattern profiling Oppressive country example Genome reidentification based on observables Underlying social network SAT solver generic matching