In a recent safari trip to Tanzania, East Africa, I observed that lions are not interested in attacking the human visitors at all. What is the secret to the safari's (non existent) security measures for their visitors? How do we determine the optimum tradeoff between enjoying the safari and safety? How can we quantify the risk? And ultimately, how can we apply these lessons + data anonymization techniques to Big Data?
1 of 39
More Related Content
Lions, zebras and Big Data Anonymization
2. Data anonymization is the process applied on
data to prevent identification of
individuals, making it possible to
share and analyze data securely.
3. Disclaimer: Stuff
shared here are
personal research,
does not represent
any organization
policies.
Prof Khaled El
Emam worked on
anonymizing heritage
health prize data
5. Yes we are safe, as long as the lions prefer
eating fat, juicy zebras than us.
6. The safari rules
1.If you are the , you just need to be faster than
the slowest zebra.
2.If you are the , you need to be able to escape
from all the lions.
3.If you are the safari visitors,
to the lions & zebras without
getting hurt.
9. The data anonymization rules
1.If you are the , you just need to hack through
the weakest link.
2.If you are the , you need to be protected from
all the hackers.
3.If you are anonymizing the data,
12. Known Knowns
Most users do not care.
Not all data that can be shared
should be shared.
Data policies needs updating.
Laws, Standards & Regulations.
Will people abuse their access
rights?
What are the damages if data got
compromised?
Motivations of hackers?
Known Unknowns
Unknown UnknownsUnknown knowns
?
Minimize risk, find out more
Be preparedWhat we should already know
Who have official access?
Resources we have?
Value of data?
What are the identifiers?
Sharing the data?
Different data policies?
Laws, Standards & Regulations.
14. Hard Methods
More difficult to analyze
Soft Methods
Easier to analyze
Hashing
Encryption
Lv1
Lv2
Lv3
Remove: ---
Reduce: Mr. S
Reclassify: 40+yrs
Mask: 1234****
Black box
Sampling
Add noise / fake data
Shuffle
Breaking big data machine
learning
15. Hard Methods
Strong security, difficult to
analyze, dangerous if cracked
Soft Methods
Flexible security strength, easier to
analyze, anonymized
Hashing
Encryption
Lv1
Lv2
Lv3
Remove: ---
Reduce: Mr. S
Reclassify: 40+yrs
Mask: 1234****
Black box
Sampling
Add noise / fake data
Shuffle
Breaking big data machine
learning
For best results, use a combination
of techniques
16. Lv1: RRRM: Quick and dirty
Remove ID S12345739Y -> ----
Reduce Mr. Smith -> Mr. S,
St 21, XY Road, Bedok-> Bedok
Reclassify 43 yrs old -> 40+
$1,029,199 income-> $1million+
Mask 12345678->1234****
But these techniques are not good enough
17. "There are lots of smokers in the health records, but once you
narrow it down to an anonymous male black smoker born in
1965 who presented at the emergency room with aching
joints, it's actually pretty simple to merge the "anonymous"
record with a different "anonymised" database and out pops
the near-certain identity of the patient." ~ Cory Doctorow,
theguardian
Multi variable identification
Big Data is a double edge sword
18. Lv2: Black Box (No data visibility for data scientist)
Algorithm, Software,
System or People
In-house or 3rd Party
Requests Summarized
Results
19. Lv2: Sampling (lowers accuracy)
Probability
Simple Random
Systematic
Stratified
Probability Proportional to Size
Cluster
Nonprobability
(Try not to
use these)
Convenience
Quota
Purposive
20. Lv2: Sampling (lowers accuracy)
Probability
Simple Random
Systematic
Stratified
Probability Proportional to Size
Cluster
Nonprobability
(Try not to
use these)
Convenience
Quota
Purposive
All data
Data Collected
Sample
22. Lv2: Add noise / fake data (lowers accuracy)
Name: Adam Smith
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Name: David Hume
Visit1: 19/04/13
Visit2: 26/05/13
Visit3: 06/06/13
Noise:
+5 days
Fake, male Scottish Name
Group visits by same person together and
apply same amount of noise
Name: David Abram
Visit1: 01/02/13
Visit2: 11/02/13
Name: David Abram
Visit1: 27/01/13
Visit2: 06/02/13
Affects daily/ monthly pattern
Noise:
-5 days
23. Lv2: Shuffle (may break data relationships but retains trend)
Name: Adam Smith
Purchase1 : Cabbage
Purchase2 : Tomato
Name: David Abram
Purchase1 : Bread
Purchase2 : Sushi
Name: Emma Goldman
Purchase1: Female Hygiene
Purchase2 : Strawberry
Shuffle
Name: Adam Smith
Purchase1 : Bread
Purchase2 : Sushi
Name: David Abram
Purchase1 : Cabbage
Purchase2 : Tomato
Name: Emma Goldman
Purchase1: Female Hygiene
Purchase2 : Strawberry
Different gender, cannot shuffle with Adam/David
Name: Emma Goldman
Purchase1: Female Hygiene
Purchase2 : Strawberry
From David
From Adam
24. Are we safe?
RRRM
ID: S1235930X
Name: Adam Smith
Age: 45
Postal:428102
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
ID: -----
Name: Mr. S
Age: 40+yrs
Postal:428***
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
25. Are we safe?
Noise /
Fake
ID: S1235930X
Name: Adam Smith
Age: 45
Postal:428102
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
ID: -----
Name: Mr. H
Age: 40+yrs
Postal:428***
Visit1: 15/04/13
Visit2: 26/05/13
Visit3: 06/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
26. Are we safe?
Shuffle
ID: S1235930X
Name: Adam Smith
Age: 45
Postal:428102
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
ID: -----
Name: Mr. H
Age: 40+yrs
Postal:428***
Visit1: 15/04/13
Visit2: 26/05/13
Visit3: 06/06/13
Purchase1 : Bread
Purchase2 : Sushi
27. Encrypted
Are we safe? Before Vs After
ID: S1235930X
Name: Adam Smith
Age: 45
Postal:428102
Visit1: 14/04/13
Visit2: 21/05/13
Visit3: 01/06/13
Purchase1 : Cabbage
Purchase2 : Tomato
ID: -----
Name: Mr. H
Age: 40+yrs
Postal:428***
Visit1: 15/04/13
Visit2: 26/05/13
Visit3: 06/06/13
Purchase1 : Bread
Purchase2 : Sushi
28. Not really safe - Netflix case study
+
Prof. Arvind Narayanan
29. Not really safe - Netflix case study
+
Prof. Arvind Narayanan
Sparse data
Even the most prolific Netflix users has only rated a
tiny fraction of Netflixs enormous library. Thus most
columns, which represents a particular movie, are
empty. Therefore, the chances of two or more users
giving the same rating to the same set of movies is
quite small; thus sets of users movie ratings can
almost uniquely identify users.
30. Credit: Prof. Arvind Narayanan
Best match:
David
2nd Best match:
Adam
Best match:
Alice
2nd Best match:
Lisa
33. Lv3: Add trend breaking noise / fake data
Name: David Abram
Visit1: 01/02/13 (Bought item A,B,C)
Visit2: 11/02/13 (Bought item D,E)
Name: David Abram
Visit1: 26/01/13 (Bought item A,B,C,X)
Visit2: 05/02/13 (Bought item D,E)
Re order visits, add noise to date
Fake purchase
X, and sequence of visits related findings will be ignored
34. Name: Adam Smith
Purchase1 : Cabbage
Purchase2 : Tomato
Name: David Abram
Purchase1 : Bread
Purchase2 : Sushi
Name: Emma Goldman
Purchase1: Female Hygiene
Purchase2 : Strawberry
Shuffle
Name: Adam Smith
Purchase1 : Bread
Purchase2 : Sushi
Name: Emma Goldman
Purchase1 : Cabbage
Purchase2 : Tomato
Name: David Abram
Purchase1: Female Hygiene
Purchase2 : Strawberry
Gender related findings will be ignored
Lv3: Trend breaking shuffle
39. Interesting reads
Anonymizing Health Data
Data protection in the EU: the certainty of uncertainty
Robust De-anonymization of Large Sparse Datasets
Eccentricity Explained
A new way to protect privacy in large-scale genome-wide
association studies
Why 'Anonymous' Data Sometimes Isn't
Has Big Data Made Anonymity Impossible?
Anonymous Netflix Prize data not so anonymous after all
A Data Broker Offers a Peek Behind the Curtain
Editor's Notes
There are known knowns. These are things we know that we know. There are known unknowns. These are things that we know we don't know. But there are also unknown unknowns. These are things we don't know we don't know. - Donald Rumsfeld *He missed out the unknown knowns.These are things we forget or intentionallyrefuse to acknowledge that we know
These are generic examples, theseparagraphs should be customize specific domains healthcare, cloud, IT, banks etc. Also we need these content let management understand what they dunno they dunno, so they can maybe feel less scared
*Beware of data with multiple, related records in a time series
Sparse dataEven the most prolific Netflix users has only rated a tiny fraction of Netflixs enormous library. Thus most columns, which represents a particular movie, are empty. Therefore, the chances of two or more users giving the same rating to the same set of movies is quite small; thus sets of users movie ratings can almost uniquely identify users.
Especially for obscure movieshttp://www.cs.utexas.edu/~shmat/netflix-faq.html
Ultimately, it is our responsibility as people who handle data to learn how to protect the data from the lions, the people want to watch the world burn. It is important for us to have the skills to ensure that data analytics can continue with a spirit of sharing, learning and gaining insights from one another and not be obstructed by fear of the bad guys
Responsibility of data scientist to learn data anonymization