際際滷

際際滷Share a Scribd company logo
Lions, zebras and Big Data Anonymization
Data anonymization is the process applied on
data to prevent identification of
individuals, making it possible to
share and analyze data securely.
Disclaimer: Stuff
shared here are
personal research,
does not represent
any organization
policies.
Prof Khaled El
Emam worked on
anonymizing heritage
health prize data
Are we safe?
Yes we are safe, as long as the lions prefer
eating fat, juicy zebras than us.
The safari rules
1.If you are the , you just need to be faster than
the slowest zebra.
2.If you are the , you need to be able to escape
from all the lions.
3.If you are the safari visitors,
to the lions & zebras without
getting hurt.
Enjoyment
Security
Max enjoyment:
Live with the lions for a week
Max security:
Stay at home watch National Geographic
Determined by
risk appetite
How can we apply this to
Data Anonymization
The data anonymization rules
1.If you are the , you just need to hack through
the weakest link.
2.If you are the , you need to be protected from
all the hackers.
3.If you are anonymizing the data,
Analytical
Usefulness
Security
Max Usefulness:
Raw data
Determined by
risk appetite
Max security:
Lock up data, dont do any analysis
11
Known
Knowns
Known
Unknowns
Unknown
Unknowns
Unknown
knowns
Donalds Matrix
*Important to know
Known Knowns
Most users do not care.
Not all data that can be shared
should be shared.
Data policies needs updating.
Laws, Standards & Regulations.
Will people abuse their access
rights?
What are the damages if data got
compromised?
Motivations of hackers?
Known Unknowns
Unknown UnknownsUnknown knowns
?
Minimize risk, find out more
Be preparedWhat we should already know
Who have official access?
Resources we have?
Value of data?
What are the identifiers?
Sharing the data?
Different data policies?
Laws, Standards & Regulations.
What are the techniques for
Data Anonymization
Hard Methods
More difficult to analyze
Soft Methods
Easier to analyze
Hashing
Encryption
Lv1
Lv2
Lv3
Remove: ---
Reduce: Mr. S
Reclassify: 40+yrs
Mask: 1234****
Black box
Sampling
Add noise / fake data
Shuffle
Breaking big data machine
learning
Hard Methods
Strong security, difficult to
analyze, dangerous if cracked
Soft Methods
Flexible security strength, easier to
analyze, anonymized
Hashing
Encryption
Lv1
Lv2
Lv3
Remove: ---
Reduce: Mr. S
Reclassify: 40+yrs
Mask: 1234****
Black box
Sampling
Add noise / fake data
Shuffle
Breaking big data machine
learning
For best results, use a combination
of techniques
Lv1: RRRM: Quick and dirty
Remove ID S12345739Y -> ----
Reduce Mr. Smith -> Mr. S,
St 21, XY Road, Bedok-> Bedok
Reclassify 43 yrs old -> 40+
$1,029,199 income-> $1million+
Mask 12345678->1234****
But these techniques are not good enough
"There are lots of smokers in the health records, but once you
narrow it down to an anonymous male black smoker born in
1965 who presented at the emergency room with aching
joints, it's actually pretty simple to merge the "anonymous"
record with a different "anonymised" database and out pops
the near-certain identity of the patient." ~ Cory Doctorow,
theguardian
Multi variable identification
Big Data is a double edge sword
Lv2: Black Box (No data visibility for data scientist)
Algorithm, Software,
System or People
In-house or 3rd Party
Requests Summarized
Results
Lv2: Sampling (lowers accuracy)
Probability
Simple Random
Systematic
Stratified
Probability Proportional to Size
Cluster
Nonprobability
(Try not to
use these)
Convenience
Quota
Purposive
Lv2: Sampling (lowers accuracy)
Probability
Simple Random
Systematic
Stratified
Probability Proportional to Size
Cluster
Nonprobability
(Try not to
use these)
Convenience
Quota
Purposive
All data
Data Collected
Sample
Lv2 Noise, fake & shuffle within data clusters
Lv2: Add noise / fake data (lowers accuracy)
Name: Adam Smith
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Name: David Hume
Visit1: 19/04/13
Visit2: 26/05/13
Visit3: 06/06/13
Noise:
+5 days
Fake, male Scottish Name
Group visits by same person together and
apply same amount of noise
Name: David Abram
Visit1: 01/02/13
Visit2: 11/02/13
Name: David Abram
Visit1: 27/01/13
Visit2: 06/02/13
Affects daily/ monthly pattern
Noise:
-5 days
Lv2: Shuffle (may break data relationships but retains trend)
Name: Adam Smith
Purchase1 : Cabbage
Purchase2 : Tomato
Name: David Abram
Purchase1 : Bread
Purchase2 : Sushi
Name: Emma Goldman
Purchase1: Female Hygiene
Purchase2 : Strawberry
Shuffle
Name: Adam Smith
Purchase1 : Bread
Purchase2 : Sushi
Name: David Abram
Purchase1 : Cabbage
Purchase2 : Tomato
Name: Emma Goldman
Purchase1: Female Hygiene
Purchase2 : Strawberry
Different gender, cannot shuffle with Adam/David
Name: Emma Goldman
Purchase1: Female Hygiene
Purchase2 : Strawberry
From David
From Adam
Are we safe?
RRRM
ID: S1235930X
Name: Adam Smith
Age: 45
Postal:428102
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
ID: -----
Name: Mr. S
Age: 40+yrs
Postal:428***
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
Are we safe?
Noise /
Fake
ID: S1235930X
Name: Adam Smith
Age: 45
Postal:428102
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
ID: -----
Name: Mr. H
Age: 40+yrs
Postal:428***
Visit1: 15/04/13
Visit2: 26/05/13
Visit3: 06/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
Are we safe?
Shuffle
ID: S1235930X
Name: Adam Smith
Age: 45
Postal:428102
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
ID: -----
Name: Mr. H
Age: 40+yrs
Postal:428***
Visit1: 15/04/13
Visit2: 26/05/13
Visit3: 06/06/13
Purchase1 : Bread
Purchase2 : Sushi
Encrypted
Are we safe? Before Vs After
ID: S1235930X
Name: Adam Smith
Age: 45
Postal:428102
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
ID: -----
Name: Mr. H
Age: 40+yrs
Postal:428***
Visit1: 15/04/13
Visit2: 26/05/13
Visit3: 06/06/13
Purchase1 : Bread
Purchase2 : Sushi
Not really safe - Netflix case study
+
Prof. Arvind Narayanan
Not really safe - Netflix case study
+
Prof. Arvind Narayanan
Sparse data
Even the most prolific Netflix users has only rated a
tiny fraction of Netflixs enormous library. Thus most
columns, which represents a particular movie, are
empty. Therefore, the chances of two or more users
giving the same rating to the same set of movies is
quite small; thus sets of users movie ratings can
almost uniquely identify users.
Credit: Prof. Arvind Narayanan
Best match:
David
2nd Best match:
Adam
Best match:
Alice
2nd Best match:
Lisa
Lv3: Breaking Big Data Machine Learning
Lv3 Noise, fake & shuffle across data clusters
Lv3: Add trend breaking noise / fake data
Name: David Abram
Visit1: 01/02/13 (Bought item A,B,C)
Visit2: 11/02/13 (Bought item D,E)
Name: David Abram
Visit1: 26/01/13 (Bought item A,B,C,X)
Visit2: 05/02/13 (Bought item D,E)
Re order visits, add noise to date
Fake purchase
X, and sequence of visits related findings will be ignored
Name: Adam Smith
Purchase1 : Cabbage
Purchase2 : Tomato
Name: David Abram
Purchase1 : Bread
Purchase2 : Sushi
Name: Emma Goldman
Purchase1: Female Hygiene
Purchase2 : Strawberry
Shuffle
Name: Adam Smith
Purchase1 : Bread
Purchase2 : Sushi
Name: Emma Goldman
Purchase1 : Cabbage
Purchase2 : Tomato
Name: David Abram
Purchase1: Female Hygiene
Purchase2 : Strawberry
Gender related findings will be ignored
Lv3: Trend breaking shuffle
Analytical
Usefulness
Security
Max Usefulness:
Raw data
Determined by
risk appetite
Max security:
Lock up data, dont do any analysis
Yes we are safe, as long as the lions prefer
eating fat, juicy zebras than us.
Lions, zebras and Big Data Anonymization
Security
Analytical
Usefulness
Point of
stupidity
Known
Knowns
Known
Unknowns
Unknown
Unknowns
Unknown
knowns
Donalds Matrix
thiakx@gmail.com
Linkedin: Kai Xin, Thia
Interesting reads
 Anonymizing Health Data
 Data protection in the EU: the certainty of uncertainty
 Robust De-anonymization of Large Sparse Datasets
 Eccentricity Explained
 A new way to protect privacy in large-scale genome-wide
association studies
 Why 'Anonymous' Data Sometimes Isn't
 Has Big Data Made Anonymity Impossible?
 Anonymous Netflix Prize data not so anonymous after all
 A Data Broker Offers a Peek Behind the Curtain

More Related Content

Lions, zebras and Big Data Anonymization

  • 2. Data anonymization is the process applied on data to prevent identification of individuals, making it possible to share and analyze data securely.
  • 3. Disclaimer: Stuff shared here are personal research, does not represent any organization policies. Prof Khaled El Emam worked on anonymizing heritage health prize data
  • 5. Yes we are safe, as long as the lions prefer eating fat, juicy zebras than us.
  • 6. The safari rules 1.If you are the , you just need to be faster than the slowest zebra. 2.If you are the , you need to be able to escape from all the lions. 3.If you are the safari visitors, to the lions & zebras without getting hurt.
  • 7. Enjoyment Security Max enjoyment: Live with the lions for a week Max security: Stay at home watch National Geographic Determined by risk appetite
  • 8. How can we apply this to Data Anonymization
  • 9. The data anonymization rules 1.If you are the , you just need to hack through the weakest link. 2.If you are the , you need to be protected from all the hackers. 3.If you are anonymizing the data,
  • 10. Analytical Usefulness Security Max Usefulness: Raw data Determined by risk appetite Max security: Lock up data, dont do any analysis
  • 12. Known Knowns Most users do not care. Not all data that can be shared should be shared. Data policies needs updating. Laws, Standards & Regulations. Will people abuse their access rights? What are the damages if data got compromised? Motivations of hackers? Known Unknowns Unknown UnknownsUnknown knowns ? Minimize risk, find out more Be preparedWhat we should already know Who have official access? Resources we have? Value of data? What are the identifiers? Sharing the data? Different data policies? Laws, Standards & Regulations.
  • 13. What are the techniques for Data Anonymization
  • 14. Hard Methods More difficult to analyze Soft Methods Easier to analyze Hashing Encryption Lv1 Lv2 Lv3 Remove: --- Reduce: Mr. S Reclassify: 40+yrs Mask: 1234**** Black box Sampling Add noise / fake data Shuffle Breaking big data machine learning
  • 15. Hard Methods Strong security, difficult to analyze, dangerous if cracked Soft Methods Flexible security strength, easier to analyze, anonymized Hashing Encryption Lv1 Lv2 Lv3 Remove: --- Reduce: Mr. S Reclassify: 40+yrs Mask: 1234**** Black box Sampling Add noise / fake data Shuffle Breaking big data machine learning For best results, use a combination of techniques
  • 16. Lv1: RRRM: Quick and dirty Remove ID S12345739Y -> ---- Reduce Mr. Smith -> Mr. S, St 21, XY Road, Bedok-> Bedok Reclassify 43 yrs old -> 40+ $1,029,199 income-> $1million+ Mask 12345678->1234**** But these techniques are not good enough
  • 17. "There are lots of smokers in the health records, but once you narrow it down to an anonymous male black smoker born in 1965 who presented at the emergency room with aching joints, it's actually pretty simple to merge the "anonymous" record with a different "anonymised" database and out pops the near-certain identity of the patient." ~ Cory Doctorow, theguardian Multi variable identification Big Data is a double edge sword
  • 18. Lv2: Black Box (No data visibility for data scientist) Algorithm, Software, System or People In-house or 3rd Party Requests Summarized Results
  • 19. Lv2: Sampling (lowers accuracy) Probability Simple Random Systematic Stratified Probability Proportional to Size Cluster Nonprobability (Try not to use these) Convenience Quota Purposive
  • 20. Lv2: Sampling (lowers accuracy) Probability Simple Random Systematic Stratified Probability Proportional to Size Cluster Nonprobability (Try not to use these) Convenience Quota Purposive All data Data Collected Sample
  • 21. Lv2 Noise, fake & shuffle within data clusters
  • 22. Lv2: Add noise / fake data (lowers accuracy) Name: Adam Smith Visit1: 14/04/13 Visit2: 21/05/13 Visit3: 01/06/13 Name: David Hume Visit1: 19/04/13 Visit2: 26/05/13 Visit3: 06/06/13 Noise: +5 days Fake, male Scottish Name Group visits by same person together and apply same amount of noise Name: David Abram Visit1: 01/02/13 Visit2: 11/02/13 Name: David Abram Visit1: 27/01/13 Visit2: 06/02/13 Affects daily/ monthly pattern Noise: -5 days
  • 23. Lv2: Shuffle (may break data relationships but retains trend) Name: Adam Smith Purchase1 : Cabbage Purchase2 : Tomato Name: David Abram Purchase1 : Bread Purchase2 : Sushi Name: Emma Goldman Purchase1: Female Hygiene Purchase2 : Strawberry Shuffle Name: Adam Smith Purchase1 : Bread Purchase2 : Sushi Name: David Abram Purchase1 : Cabbage Purchase2 : Tomato Name: Emma Goldman Purchase1: Female Hygiene Purchase2 : Strawberry Different gender, cannot shuffle with Adam/David Name: Emma Goldman Purchase1: Female Hygiene Purchase2 : Strawberry From David From Adam
  • 24. Are we safe? RRRM ID: S1235930X Name: Adam Smith Age: 45 Postal:428102 Visit1: 14/04/13 Visit2: 21/05/13 Visit3: 01/06/13 Purchase1 : Cabbage Purchase2 : Tomato ID: ----- Name: Mr. S Age: 40+yrs Postal:428*** Visit1: 14/04/13 Visit2: 21/05/13 Visit3: 01/06/13 Purchase1 : Cabbage Purchase2 : Tomato
  • 25. Are we safe? Noise / Fake ID: S1235930X Name: Adam Smith Age: 45 Postal:428102 Visit1: 14/04/13 Visit2: 21/05/13 Visit3: 01/06/13 Purchase1 : Cabbage Purchase2 : Tomato ID: ----- Name: Mr. H Age: 40+yrs Postal:428*** Visit1: 15/04/13 Visit2: 26/05/13 Visit3: 06/06/13 Purchase1 : Cabbage Purchase2 : Tomato
  • 26. Are we safe? Shuffle ID: S1235930X Name: Adam Smith Age: 45 Postal:428102 Visit1: 14/04/13 Visit2: 21/05/13 Visit3: 01/06/13 Purchase1 : Cabbage Purchase2 : Tomato ID: ----- Name: Mr. H Age: 40+yrs Postal:428*** Visit1: 15/04/13 Visit2: 26/05/13 Visit3: 06/06/13 Purchase1 : Bread Purchase2 : Sushi
  • 27. Encrypted Are we safe? Before Vs After ID: S1235930X Name: Adam Smith Age: 45 Postal:428102 Visit1: 14/04/13 Visit2: 21/05/13 Visit3: 01/06/13 Purchase1 : Cabbage Purchase2 : Tomato ID: ----- Name: Mr. H Age: 40+yrs Postal:428*** Visit1: 15/04/13 Visit2: 26/05/13 Visit3: 06/06/13 Purchase1 : Bread Purchase2 : Sushi
  • 28. Not really safe - Netflix case study + Prof. Arvind Narayanan
  • 29. Not really safe - Netflix case study + Prof. Arvind Narayanan Sparse data Even the most prolific Netflix users has only rated a tiny fraction of Netflixs enormous library. Thus most columns, which represents a particular movie, are empty. Therefore, the chances of two or more users giving the same rating to the same set of movies is quite small; thus sets of users movie ratings can almost uniquely identify users.
  • 30. Credit: Prof. Arvind Narayanan Best match: David 2nd Best match: Adam Best match: Alice 2nd Best match: Lisa
  • 31. Lv3: Breaking Big Data Machine Learning
  • 32. Lv3 Noise, fake & shuffle across data clusters
  • 33. Lv3: Add trend breaking noise / fake data Name: David Abram Visit1: 01/02/13 (Bought item A,B,C) Visit2: 11/02/13 (Bought item D,E) Name: David Abram Visit1: 26/01/13 (Bought item A,B,C,X) Visit2: 05/02/13 (Bought item D,E) Re order visits, add noise to date Fake purchase X, and sequence of visits related findings will be ignored
  • 34. Name: Adam Smith Purchase1 : Cabbage Purchase2 : Tomato Name: David Abram Purchase1 : Bread Purchase2 : Sushi Name: Emma Goldman Purchase1: Female Hygiene Purchase2 : Strawberry Shuffle Name: Adam Smith Purchase1 : Bread Purchase2 : Sushi Name: Emma Goldman Purchase1 : Cabbage Purchase2 : Tomato Name: David Abram Purchase1: Female Hygiene Purchase2 : Strawberry Gender related findings will be ignored Lv3: Trend breaking shuffle
  • 35. Analytical Usefulness Security Max Usefulness: Raw data Determined by risk appetite Max security: Lock up data, dont do any analysis
  • 36. Yes we are safe, as long as the lions prefer eating fat, juicy zebras than us.
  • 39. Interesting reads Anonymizing Health Data Data protection in the EU: the certainty of uncertainty Robust De-anonymization of Large Sparse Datasets Eccentricity Explained A new way to protect privacy in large-scale genome-wide association studies Why 'Anonymous' Data Sometimes Isn't Has Big Data Made Anonymity Impossible? Anonymous Netflix Prize data not so anonymous after all A Data Broker Offers a Peek Behind the Curtain

Editor's Notes

  1. There are known knowns. These are things we know that we know. There are known unknowns. These are things that we know we don't know. But there are also unknown unknowns. These are things we don't know we don't know. - Donald Rumsfeld *He missed out the unknown knowns.These are things we forget or intentionallyrefuse to acknowledge that we know
  2. These are generic examples, theseparagraphs should be customize specific domains healthcare, cloud, IT, banks etc. Also we need these content let management understand what they dunno they dunno, so they can maybe feel less scared
  3. *Beware of data with multiple, related records in a time series
  4. http://www.theguardian.com/technology/blog/2013/jun/05/data-protection-eu-anonymous
  5. Sparse dataEven the most prolific Netflix users has only rated a tiny fraction of Netflixs enormous library. Thus most columns, which represents a particular movie, are empty. Therefore, the chances of two or more users giving the same rating to the same set of movies is quite small; thus sets of users movie ratings can almost uniquely identify users.
  6. Especially for obscure movieshttp://www.cs.utexas.edu/~shmat/netflix-faq.html
  7. http://33bits.org/2008/10/03/eccentricity-explained/
  8. Ultimately, it is our responsibility as people who handle data to learn how to protect the data from the lions, the people want to watch the world burn. It is important for us to have the skills to ensure that data analytics can continue with a spirit of sharing, learning and gaining insights from one another and not be obstructed by fear of the bad guys
  9. Responsibility of data scientist to learn data anonymization