狠狠撸

Large Scale Threats to Data Anonymity Arvind Narayanan Joint work with Vitaly Shmatikov Kamalika Chaudhuri

Anonymity is not cryptography Small 鈥渒eyspace鈥� 鈥� random guessing succeeds with probability 1/N Natural upper bound on N 鈥� the race is over ! Guess-and-verify paradigm Even quadratic algorithms sometimes feasible! Conventional wisdom relied on computational infeasibility

The curse of dimensionality Too much entropy per record How high is high? Try 35,540! k-anonymity breaks down Nearest neigbhor too far Cinematch beats baseline by 1%! Projection to low dimensions loses most of the info

Auxiliary information Auxiliary information about people very easy to obtain Unlinkability of user traces 鈥� unaffordable luxury Yet linking across databases often disastrous Future privacy 鈥� linkage of 鈥減rofile鈥� to identify makes virtual identities impossible

Two fallacies Identifying vs. non-identifying attributes All attributes are quasi-identifiers! Simply removing record labels is not sufficient Perturbation makes attacker鈥檚 task harder Note superficial similarity with LPN But non-cryptographic! Reality: re-identification algorithms easily made noise resilient

Interactive protocols Severe computational limits Query-execute-analyze cycle Utility required may be non-statistical Database may even be non-relational Privacy for queries Data aggregator not trusted Algorithms in distributed setting not well developed yet

Sad realization #2 Privacy usually an afterthought (not important until it affects you) Video privacy act example Privacy vs. utility: Collect/release the data, ask questions later

Sweeney 鈥� linking (exact match) (Anonymous) (Non-anonymous) Hardly secret Probably not secret

Collaborative filtering: profiles Each of N users has a preference vector, or a preference profile One attribute for each item Goal: mine this database to predict preferences for new items Can we release an anonymized database of preference vectors?

Movielens 鈥� fuzzy match Hypothetical investigation Frankowski, Cosley, Sen, Terveen, Riedl. Anonymized database of movie ratings Attacker knows small number of approximate preferences Nearest neighbor stats confirmed

Netflix 鈥� fuzzy match with noise Nearest neighbor graph Real attack, Narayanan & Shmatikov ~ 4 movies -> unique re-identification know either ratings or dates approximately one of the data points can be completely wrong Found a couple of our friends Found a couple of users from IMDb

Netflix鈥檚 take on privacy Even if, for example, you knew all your own ratings and their dates you probably couldn鈥檛 identify them reliably in the data because only a small sample was included (less than one-tenth of our complete dataset) and that data was subject to perturbation . Of course, since you know all your own ratings that really isn鈥檛 a privacy problem is it? -- Netflix Prize FAQ

Netflix 鈥� contributions Scoring tolerates large amount of noise 鈭� i 袆 M 鈭� M鈥� [ e - 伪 |r i - r i 鈥榺 + c e - 尾 |d i - d i 鈥榺 + 螕 ] / log #i Verifying deanonymization in absence of oracle [score(max) 鈥� score(max2)] / std.dev(score) Extract user relationships

Netflix customers with distance < 0.15 Could edges reflect real-life relationships? Ratings and dates were ignored

Recommenders: stronger attacks Do recommendation systems inherently leak profile? No data release! Theoretical attacks known Textbook systems Deployed, complex systems

Social networks Graph of interactions between people Think of phone call graphs Different type of profile Non 鈥� relational data

Backstrom, Dwork, Kleinberg Active and passive attacks Re-identify nodes touched by malicious edges Easy to find graph-structured patterns in large database

Narayanan, Chaudhuri Tolerates noise Several attacks where a user can re-identify own node Subgraph isomorphism with several hundred nodes Heuristics involving node labels User knows own degree exactly Modern phones store all calls Who deletes email anymore?

Finding yourself N instances of graph isomorphism Use isomorphism-invariant signatures

Propagation of node re-identification Surprisingly small number of seeds (6-12) Large fraction of nodes compromised Works even when large fraction (say 80%) of nodes are honest

Propagation 鈥� implementation Social phishing Buddy zoo Skype worm Online addressbook service Competing social network

Author identification Basically, a solved problem However, most studies use a small set of authors Not clear how well sample size required scales Combine with typing pattern profiling Possibly deanonymize among millions/billions of users Example: oppressive country

Genome anonymity Rich social network ~10^8 bits entropy per record Labeled sample compromises privacy of blood relatives Crossover happens in precise, elegant way Work on admixing populations Story of deanonymization of sperm donor Ease of obtaining auxiliary data or anonymous samples

Genome and DNA databases Hapmap 鈥� entire genome 鈥� Family tree鈥� services 1/800 births from 鈥渁nonymous鈥� sperm donor

Hapmap鈥檚 take on privacy The samples are anonymous with regard to individual identity. Samples cannot be connected to individuals, and no personal information is linked to any sample. As an additional safeguard, more samples were collected from each population than were used, so no one knows whether any particular person's DNA is included in the study.

Trait听听听听听听听听听听听听听听听听听听 Genes听听听听听听听 Chromosome location Hair/iris color听听听听听听听听听听听听 ASIP听听听听听听听听听听听听听听听 20 q11.2 Hair/iris color听听听听听听听听听听听听 DCT听听听听听听听听听听听听听听听 13 q32 Green/blue iris听听听听听听听听听听听 EYCL1听听听听听听听听听听 19 p13.1-q13.11 Brown/blue iris听听听听听听听听听听 EYCL3听听听听听听听听听听听 15 q11-q15 * Height 听听听听听听听听听听听听听听听听听听听听听听听听 GH1听听听听听听听听听听听听听听听 17 q22-q24 Height (Laron)听听听听听听听听听听听听 GHR听听听听听听听听听听听听听听听听 5 p13-p12 Brown/blond hair听听听听听听听 HCL1听听听听听听听听听听听听 19 p13.1-q13.11 Brown/blond hair听听听听听听听 HCL3 听听听听听听听听听听听听 15 q11-q15 听 * Brown/red hair听听听听听听听听听听听 HCL2 听听听听听听听听听听听听听 4 q28-q31 Hair/iris color听听听听听听听听听听听听 HPS1听听听听听听听听听听听听听听 10 q23.1-23.3 Hair/iris color听听听听听听听听听听听听 HPS2听听听听听听听听听听听听听听 10 q24.32 Skin&hair color听听听听听听听听听 MC1R听听听听听听听听听听听听 16 q24.3 Height (Marfan)听听听听听听听听听 MFS 听听听听听听听听听听听听听听 15 q21.1 Hair/iris color听听听听听听听听听听听听 MITF听听听听听听听听听听听听听听听 3 p12.3-14.1 Hair/iris color听听听听听听听听听听听听 MYO5A听听听听听听听听听 15 q21 Ocular albinism听听听听听听听听听听 OA1听听听听听听听听听听听听听听听 X p22.3 听 Ocular albinism听听听听听听听听听听 OA2听听听听听听听听听听听听听听听 X p11.4-p11.23 OcculoCut.Albinism听听听 OCA2听听听听听听听听听听听听听 15 q11.2-q12 听 Hair/iris color听听听听听听听听听听听听 PMOC听听听听听听听听听听听听听 2 p23.3 Hair/iris color听听听听听听听听听听听听 RAB27A听听听听听听听听 15 q15-21.1 Hair/iris color听听听听听听听听听听听听 SILV听听听听听听听听听听听听听听听 12 q13-q14 Skin color听听听听听听听听听听听听听听听听听听听 SLC24A5听听听听听听听 15 q21.1 A111T dark to light skin Short Stature听听听听听听听听听听听听听听听 SS听听听听听听听听听听听听听听听听听听听 X&Y p Hair/iris color听听听听听听听听听听听听 TYR听听听听听听听听听听听听听听听 11 q14-q21 Hair/iris color听听听听听听听听听听听听 TYRP1听听听听听听听听听听听听 9 p23

Genotype 鈥� phenotype mappings The medical community finds genotype -> phenotype mappings Mappings being generated 鈥渁t an explosive rate鈥� But also: [Sweeney02]: Inferring genotype from clinical phenotype through a knowledge based algorithm focuses on pathological phenotypes

Big picture Attacks against a wide spectrum of rich, high-dimensional datasets Can we win the battle? Using technology alone? What if we don鈥檛? Is part of it already lost?

Current work Sweeney 鈥� exact match Movielens 鈥� fuzzy match Netflix 鈥� fuzzy match with noise AOL BDK07 鈥� match on non-relational data NC07 鈥� non-relational data with noise Amazon 鈥� fuzzy match with noise on utility oracle Genome 鈥� match based on multiple databases Genome 鈥� phenotype/genotype mapping

Future work Author identification Combine with typing pattern profiling Oppressive country example Genome reidentification based on observables Underlying social network SAT solver 鈥� generic matching

狠狠撸

Anonymity

More Related Content

Anonymity