際際滷

際際滷Share a Scribd company logo
Hash
A probabilistic approach for big data
Luca Mastrostefano
Who am I?
 Product manager of MyMemory at Translated
 IT background
 Algorithms lover
Luca Mastrostefano
luca@translated.net
Syllabus
Problem Use case
Fast and exact search Databases - Search
Stream filter Translated - MyMemory
Counting unique items in a stream ClickMeter - IPs analysis
Probabilistic search Memopal - Search for similar files
Search algorithms
Databases - Fast and exact search
Static, extendible and linear hash indexes
Use case
Sometimes also a logarithmic complexity is
too expensive.
B+
tree index
Images from Data Management - Maurizio Lenzerini
Select/Insert  LogF
(# items)
Search - Hash index
Static hash index
Images from Data Management - Maurizio Lenzerini
Select/Insert  2 + (# overflow pages)
Directories
Images from Data Management - Maurizio Lenzerini
Dynamic hash index - Extendible
Select/Insert 
2 + (# overflow pages)
# overflow pages almost constant
Intuition:
 Avoid the directories to save one memory access.
 Split one bucket per time: it fits real-time environments!
Dynamic hash index - Linear
Select/Insert 
1 + (# overflow pages)
# overflow pages almost constant
4x in case of billions of entries
Select/Insert  Log
VS
B+
tree index
Indexes comparison - Secondary memory accesses
Linear hash index
Select/Insert  const
1 access  7 ms4 accesses  30 ms
Stream filter: x  U ?
Translated - MyMemory
Bloom filter
Use case
The delay introduced by the secondary
memory does not fit an environment in which
milliseconds matter.
Stream filter - Na誰ve approach
60+ GB
Hash index (1,5B items)
Network delay
5% item  Dataset
Stream filter - Bloom filter
Bloom filter - Insert
0 0 0 0 0 0 0 0 0 0 0 0 0 0
n1
...
nn
n items to insert
h1
h2
h3
k hash functions
Bit array of length m
Bloom filter - Insert
0 1 0 0 0 0 0 0 1 0 0 0 1 0
h1
h...
hk
n1
Bloom filter - Insert
0 1 1 0 0 1 0 0 1 0 0 1 1 0
h1
h...
hk
nn
Bloom filter - Search
0 1 1 0 0 1 0 0 1 0 0 1 1 0
n
a
b
...
h1
h...
hk
Items to search for
Same hash
functions
Fixed bit array
Bloom filter - Search [No false negative]
0 1 1 0 0 1 0 0 1 0 0 1 1 0
h1
h...
hk
a DOES NOT belong to the
set
a
n
b
...
Bloom filter - Search [True positive]
0 1 1 0 0 1 0 0 1 0 0 1 1 0
h1
h...
hk
n MAY belong to the set
n
b
...
Bloom filter - Search [Possible false positive]
0 1 1 0 0 1 0 0 1 0 0 1 1 0
h1
h...
hk
b
...
b MAY belong to the set
Bloom filter - Analysis
n items to insert
k hash
functions
m bits
0 1 1 0 0 1 0 0 1 0 0 1 1 0
z
...
h1
h2
h3
b
...
h1
h...
hk
The probability of a false
positive is:
P =
Bloom filter - Implementation
n items to insertk hash
functions
m bits
 Optimal number of hash function:
 Optimal number of bit m for the
desired probability p of false positive:
Bloom filter - Results
7 hash functions
2 GB (14B bit)
60+ GB VS
Na誰ve approach Bloom filter
1% of false positive
Bloom filter - Results [MyMemory]
~5% of connections
60+ GB
Hash index (1,5B items)

2 GB
bloom filter
Counting unique items in a
stream
ClickMeter - Number of unique IPs per link
Flajolet - Martin for unique hash counting
Use case
Counting unique elements could be really
costly in terms of memory.
Counting unique items - Na誰ve approach
500 MB per link
(4B bits array)
... 1 1 0 0 1 0 0 1 0 0 1 1 ...
5 PB with 10M links
0.0.0.0 255.255.255.255
Counting unique items -
Flajolet-Martin
Flajolet-Martin
...0 1 0 1 0 1 0 1 0 0 1 0 0 0
P(n trailing zeros) = ?
Flajolet-Martin
...0 1 0 1 0 1 0 1 0 0 1 0 0 0
P(n trailing zeros) = (遜)^n
# seen hashes  ?
 x x x x x x x x 0 0 0
Flajolet-Martin
...0 1 0 1 0 1 0 1 0 0 1 0 0 0
P(n trailing zeros) = (遜)^n
# seen hashes  2^n
 x x x x x x x x 0 0 1
 x x x x x x x x 0 1 0
 x x x x x x x x 0 1 1
 x x x x x x x x 1 0 0
 x x x x x x x x 1 0 1
 x x x x x x x x 1 1 0
 x x x x x x x x 1 1 1
Flajolet-Martin
0Hash ...010011011
Element Hash function Hashed value Max number of trailing zeros
x1
1Hash ...100101010x2
1Hash ...010011011x1
...
Hash ...010000000xn
log2
(n)
Flajolet-Martin
0Hash1
...010011011
Element Hash functions Hashed value Max number of trailing zeros
x1
3Hash..
...111001000
0Hashk
...110100001
...
...
Flajolet-Martin - Results
VS
Na誰ve approach Flajolet-Martin
500 MB per link
5 PB with 10M links
1,5 KB per link
15 GB with 10M links
2% of error
Probabilistic search
Memopal - Search for similar files
Local sensitive hashing & min hashing
Use case
The difference between a petabyte and a
gigabyte index is worth an approximation.
Search - Na誰ve approach
2 B files
1 PB of index
Slow search
Search - Min hash
Day was departing, and the
embrowned air
Released the animals that are
on earth
From their fatigues; and I the
only one
Made myself ready to sustain
the war,
Both of the way and likewise
of the woe,
Which memory that errs not
shall retrace.
Similarity
Midway upon the journey of
our life
I found myself within a forest
dark,
For the straightforward
pathway had been lost.
Ah me! how hard a thing it is
to say
What was this forest savage,
rough, and stern,
Which in the very thought
renews the fear.
Are they similar?
Jaccard =
Number of substrings in common
Total number of unique substrings
Document 1 Document 2
Similarity
Substrings => Shingles of length S
Storage  S * Doc_length * #Docs
Complexity  Doc_length * #Docs
Set of shingles =
...
Midway upon the,
upon the journey,
the journey of,
...
Midway upon the journey of our life
Similarity
Fingerprint => 32 bit hash of a shingle
Storage  4 byte * Doc_length * #Docs
Complexity  Doc_length * #Docs
Set of shingles =

 100101101 ,
 011010000,
 110010011 ,
Similarity
We need to find a signature Sig(D) of
length K so that
if Sig(D1
) ~ Sig(D2
) then D1
~ D2
Storage  4 byte * K * #Docs
Complexity  K * #Docs
With K << Doc_length
MinHash - Signature creation
Doc1
10101
01100
10010
00111
Take a random permutation
of the fingerprints.
Generate the fingerprints
of the documents.
Define minhash(Hn
, Doci
) = First fingerprint of Doci
hashed with
Hn
Sig(Doci
) of length K = [minhashi
, minhash2
, , minhashn
]
Doc1
00111
01100
10101
10010
Minhash of this
permutation
Hn
MinHash
Signature(Doc1
)
 100101101 
 011010000
 110010011 
 011100011 
 100100001 

Sig(Doc) is a set of K min-hashing fingerprints:
Signature(Docn
)
 100001101 
 101010110
 110010011 
 010100101 
 100100001
MinHash
If Sig(D1
) ~ Sig(D2
) then Doc1
~ Doc2
P(X = 1) = Jaccard(Doc1
, Doc2
)
 X / K  Jaccard(Doc1
, Doc2
)
 100101101 
 011010000
 110010011 
 011100011 
 100100001 

 100001101 
 101010110
 110010011 
 010100101 
 100100001 

Signature(Doc1
) Signature(Doc2
)X
1
0
1
0
1
MinHash - Implementation
1. Generate the fingerprints of the document
2. Define K hash functions: h1
, h2
, ....
, hk
.
3. Define Sig(Doc) = [h1
(Doc), h2
(Doc), ..., hk
(Doc)]
4. Define O = { i / hi
(Doc1
) = hi
(Doc2
) }
5. Sim(Doc1
, Doc2
) =  Jaccard(Doc1
, Doc2
)
| O |
K
Storage  4 byte * K * #Docs
Complexity  K * #Docs
With K << Doc_length
Local Sensitive Hashing
Signature(Doc) =
 100101101 
 011010000
 110010011 



Divide the signature Sig(Doc) into B bands of R rows each, such that B*R = K:
band 1
band 2
band ...
band B
} R fingerprints
 Threshold  (1/B)^(1/R)
Local Sensitive Hashing - Analysis
Probability of a document having at least band in common: 1 - (1 - jR
)B
Jaccard of documents
Probability of
becoming a
candidate
S-curve
R
B
 Threshold  (1/B)^(1/R)
 True Positive
 True Negative
 False Positive
 False Negative
Local Sensitive Hashing - Analysis
Probability of a document having at least band in common: 1 - (1 - jR
)B
Jaccard of documents
Probability of
becoming a
candidate
S-curve
R
B
Probabilistic search - Results
Storage  Shingle_length * Doc_length * #Docs
Complexity  Doc_length * #Docs
From:
To:
Storage  4 byte * K * #Docs
Complexity  K * #Docs * p(candidate)
With K << Doc_length and p(candidate) << 1
Probabilistic search - Results
VS
Na誰ve approach Min hash + LSH
2 B files
1 PB of index
Slow search
2 B files
1,5 TB of index
Fast search & update
Thank you
P(|questions| > 0) = 1 - [1 - p(question)]|audience|
Any questions?
Ad

Recommended

Data Governance with JSON Schema
Data Governance with JSON Schema
MongoDB
NOSQL: il rinascimento dei database?
NOSQL: il rinascimento dei database?
Paolo Bernardi
Xz file-format-1.0.4
Xz file-format-1.0.4
Ben Pope
easyXDM
easyXDM
willing8310
Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)
Yueshen Xu
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
Andrii Gakhov
Aggregation computation over distributed data streams
Aggregation computation over distributed data streams
Yueshen Xu
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
Raja Chiky
Probabilistic data structure
Probabilistic data structure
Thinh Dang
New zealand bloom filter
New zealand bloom filter
xlight
Unit 5 Streams2.pptx
Unit 5 Streams2.pptx
SonaliAjankar
Hash Functions FTW
Hash Functions FTW
sunnygleason
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
NUS-ISS
Bloom Filters: An Introduction
Bloom Filters: An Introduction
IRJET Journal
Probabilistic data structures
Probabilistic data structures
shrinivasvasala
FS-Mod5.pptx
FS-Mod5.pptx
PurushottamPurshi
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
PyData
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
Data monsters probablistic data structures
Data monsters probablistic data structures
GreenM
RecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect Hashing
Thomas Mueller
hashing in data structures and its applications
hashing in data structures and its applications
manjeshbngowda
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate Solutions
Oleksandr Pryymak
3 - Finding similar items
3 - Finding similar items
Viet-Trung TRAN
Hashing
Hashing
amoldkul
Data Mining Lecture_6.pptx
Data Mining Lecture_6.pptx
Subrata Kumer Paul
Bloom filters
Bloom filters
Devesh Maru
On Improving the Performance of Data Leak Prevention using White-list Approach
On Improving the Performance of Data Leak Prevention using White-list Approach
Patrick Nguyen
Bloom filter
Bloom filter
feng lee
Key Requirements to Successfully Implement Generative AI in Edge DevicesOpt...
Key Requirements to Successfully Implement Generative AI in Edge DevicesOpt...
Edge AI and Vision Alliance
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
Priyanka Aash

More Related Content

Similar to Hash - A probabilistic approach for big data (20)

Probabilistic data structure
Probabilistic data structure
Thinh Dang
New zealand bloom filter
New zealand bloom filter
xlight
Unit 5 Streams2.pptx
Unit 5 Streams2.pptx
SonaliAjankar
Hash Functions FTW
Hash Functions FTW
sunnygleason
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
NUS-ISS
Bloom Filters: An Introduction
Bloom Filters: An Introduction
IRJET Journal
Probabilistic data structures
Probabilistic data structures
shrinivasvasala
FS-Mod5.pptx
FS-Mod5.pptx
PurushottamPurshi
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
PyData
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
Data monsters probablistic data structures
Data monsters probablistic data structures
GreenM
RecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect Hashing
Thomas Mueller
hashing in data structures and its applications
hashing in data structures and its applications
manjeshbngowda
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate Solutions
Oleksandr Pryymak
3 - Finding similar items
3 - Finding similar items
Viet-Trung TRAN
Hashing
Hashing
amoldkul
Data Mining Lecture_6.pptx
Data Mining Lecture_6.pptx
Subrata Kumer Paul
Bloom filters
Bloom filters
Devesh Maru
On Improving the Performance of Data Leak Prevention using White-list Approach
On Improving the Performance of Data Leak Prevention using White-list Approach
Patrick Nguyen
Bloom filter
Bloom filter
feng lee
Probabilistic data structure
Probabilistic data structure
Thinh Dang
New zealand bloom filter
New zealand bloom filter
xlight
Unit 5 Streams2.pptx
Unit 5 Streams2.pptx
SonaliAjankar
Hash Functions FTW
Hash Functions FTW
sunnygleason
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
NUS-ISS
Bloom Filters: An Introduction
Bloom Filters: An Introduction
IRJET Journal
Probabilistic data structures
Probabilistic data structures
shrinivasvasala
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
PyData
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
Data monsters probablistic data structures
Data monsters probablistic data structures
GreenM
RecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect Hashing
Thomas Mueller
hashing in data structures and its applications
hashing in data structures and its applications
manjeshbngowda
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate Solutions
Oleksandr Pryymak
3 - Finding similar items
3 - Finding similar items
Viet-Trung TRAN
Hashing
Hashing
amoldkul
Bloom filters
Bloom filters
Devesh Maru
On Improving the Performance of Data Leak Prevention using White-list Approach
On Improving the Performance of Data Leak Prevention using White-list Approach
Patrick Nguyen
Bloom filter
Bloom filter
feng lee

Recently uploaded (20)

Key Requirements to Successfully Implement Generative AI in Edge DevicesOpt...
Key Requirements to Successfully Implement Generative AI in Edge DevicesOpt...
Edge AI and Vision Alliance
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
Priyanka Aash
From Manual to Auto Searching- FME in the Driver's Seat
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
The Future of Technology: 2025-2125 by Saikat Basu.pdf
The Future of Technology: 2025-2125 by Saikat Basu.pdf
Saikat Basu
FIDO Seminar: Perspectives on Passkeys & Consumer Adoption.pptx
FIDO Seminar: Perspectives on Passkeys & Consumer Adoption.pptx
FIDO Alliance
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
OpenPOWER Foundation & Open-Source Core Innovations
OpenPOWER Foundation & Open-Source Core Innovations
IBM
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
biswajitbanerjee38
Crypto Super 500 - 14th Report - June2025.pdf
Crypto Super 500 - 14th Report - June2025.pdf
Stephen Perrenod
The Future of Data, AI, and AR: Innovation Inspired by You.pdf
The Future of Data, AI, and AR: Innovation Inspired by You.pdf
Safe Software
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
FIDO Seminar: Authentication for a Billion Consumers - Amazon.pptx
FIDO Seminar: Authentication for a Billion Consumers - Amazon.pptx
FIDO Alliance
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
FIDO Seminar: New Data: Passkey Adoption in the Workforce.pptx
FIDO Seminar: New Data: Passkey Adoption in the Workforce.pptx
FIDO Alliance
FIDO Seminar: Evolving Landscape of Post-Quantum Cryptography.pptx
FIDO Seminar: Evolving Landscape of Post-Quantum Cryptography.pptx
FIDO Alliance
OWASP Barcelona 2025 Threat Model Library
OWASP Barcelona 2025 Threat Model Library
PetraVukmirovic
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
Securing Account Lifecycles in the Age of Deepfakes.pptx
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
Key Requirements to Successfully Implement Generative AI in Edge DevicesOpt...
Key Requirements to Successfully Implement Generative AI in Edge DevicesOpt...
Edge AI and Vision Alliance
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
Priyanka Aash
From Manual to Auto Searching- FME in the Driver's Seat
From Manual to Auto Searching- FME in the Driver's Seat
Safe Software
The Future of Technology: 2025-2125 by Saikat Basu.pdf
The Future of Technology: 2025-2125 by Saikat Basu.pdf
Saikat Basu
FIDO Seminar: Perspectives on Passkeys & Consumer Adoption.pptx
FIDO Seminar: Perspectives on Passkeys & Consumer Adoption.pptx
FIDO Alliance
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
OpenPOWER Foundation & Open-Source Core Innovations
OpenPOWER Foundation & Open-Source Core Innovations
IBM
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
biswajitbanerjee38
Crypto Super 500 - 14th Report - June2025.pdf
Crypto Super 500 - 14th Report - June2025.pdf
Stephen Perrenod
The Future of Data, AI, and AR: Innovation Inspired by You.pdf
The Future of Data, AI, and AR: Innovation Inspired by You.pdf
Safe Software
Techniques for Automatic Device Identification and Network Assignment.pdf
Techniques for Automatic Device Identification and Network Assignment.pdf
Priyanka Aash
FIDO Seminar: Authentication for a Billion Consumers - Amazon.pptx
FIDO Seminar: Authentication for a Billion Consumers - Amazon.pptx
FIDO Alliance
Security Tips for Enterprise Azure Solutions
Security Tips for Enterprise Azure Solutions
Michele Leroux Bustamante
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Coordinated Disclosure for ML - What's Different and What's the Same.pdf
Priyanka Aash
FIDO Seminar: New Data: Passkey Adoption in the Workforce.pptx
FIDO Seminar: New Data: Passkey Adoption in the Workforce.pptx
FIDO Alliance
FIDO Seminar: Evolving Landscape of Post-Quantum Cryptography.pptx
FIDO Seminar: Evolving Landscape of Post-Quantum Cryptography.pptx
FIDO Alliance
OWASP Barcelona 2025 Threat Model Library
OWASP Barcelona 2025 Threat Model Library
PetraVukmirovic
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Oh, the Possibilities - Balancing Innovation and Risk with Generative AI.pdf
Priyanka Aash
Securing Account Lifecycles in the Age of Deepfakes.pptx
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
Ad

Hash - A probabilistic approach for big data

  • 1. Hash A probabilistic approach for big data Luca Mastrostefano
  • 2. Who am I? Product manager of MyMemory at Translated IT background Algorithms lover Luca Mastrostefano luca@translated.net
  • 3. Syllabus Problem Use case Fast and exact search Databases - Search Stream filter Translated - MyMemory Counting unique items in a stream ClickMeter - IPs analysis Probabilistic search Memopal - Search for similar files
  • 4. Search algorithms Databases - Fast and exact search Static, extendible and linear hash indexes
  • 5. Use case Sometimes also a logarithmic complexity is too expensive.
  • 6. B+ tree index Images from Data Management - Maurizio Lenzerini Select/Insert LogF (# items)
  • 8. Static hash index Images from Data Management - Maurizio Lenzerini Select/Insert 2 + (# overflow pages) Directories
  • 9. Images from Data Management - Maurizio Lenzerini Dynamic hash index - Extendible Select/Insert 2 + (# overflow pages) # overflow pages almost constant
  • 10. Intuition: Avoid the directories to save one memory access. Split one bucket per time: it fits real-time environments! Dynamic hash index - Linear Select/Insert 1 + (# overflow pages) # overflow pages almost constant
  • 11. 4x in case of billions of entries Select/Insert Log VS B+ tree index Indexes comparison - Secondary memory accesses Linear hash index Select/Insert const 1 access 7 ms4 accesses 30 ms
  • 12. Stream filter: x U ? Translated - MyMemory Bloom filter
  • 13. Use case The delay introduced by the secondary memory does not fit an environment in which milliseconds matter.
  • 14. Stream filter - Na誰ve approach 60+ GB Hash index (1,5B items) Network delay 5% item Dataset
  • 15. Stream filter - Bloom filter
  • 16. Bloom filter - Insert 0 0 0 0 0 0 0 0 0 0 0 0 0 0 n1 ... nn n items to insert h1 h2 h3 k hash functions Bit array of length m
  • 17. Bloom filter - Insert 0 1 0 0 0 0 0 0 1 0 0 0 1 0 h1 h... hk n1
  • 18. Bloom filter - Insert 0 1 1 0 0 1 0 0 1 0 0 1 1 0 h1 h... hk nn
  • 19. Bloom filter - Search 0 1 1 0 0 1 0 0 1 0 0 1 1 0 n a b ... h1 h... hk Items to search for Same hash functions Fixed bit array
  • 20. Bloom filter - Search [No false negative] 0 1 1 0 0 1 0 0 1 0 0 1 1 0 h1 h... hk a DOES NOT belong to the set a n b ...
  • 21. Bloom filter - Search [True positive] 0 1 1 0 0 1 0 0 1 0 0 1 1 0 h1 h... hk n MAY belong to the set n b ...
  • 22. Bloom filter - Search [Possible false positive] 0 1 1 0 0 1 0 0 1 0 0 1 1 0 h1 h... hk b ... b MAY belong to the set
  • 23. Bloom filter - Analysis n items to insert k hash functions m bits 0 1 1 0 0 1 0 0 1 0 0 1 1 0 z ... h1 h2 h3 b ... h1 h... hk The probability of a false positive is: P =
  • 24. Bloom filter - Implementation n items to insertk hash functions m bits Optimal number of hash function: Optimal number of bit m for the desired probability p of false positive:
  • 25. Bloom filter - Results 7 hash functions 2 GB (14B bit) 60+ GB VS Na誰ve approach Bloom filter 1% of false positive
  • 26. Bloom filter - Results [MyMemory] ~5% of connections 60+ GB Hash index (1,5B items) 2 GB bloom filter
  • 27. Counting unique items in a stream ClickMeter - Number of unique IPs per link Flajolet - Martin for unique hash counting
  • 28. Use case Counting unique elements could be really costly in terms of memory.
  • 29. Counting unique items - Na誰ve approach 500 MB per link (4B bits array) ... 1 1 0 0 1 0 0 1 0 0 1 1 ... 5 PB with 10M links 0.0.0.0 255.255.255.255
  • 30. Counting unique items - Flajolet-Martin
  • 31. Flajolet-Martin ...0 1 0 1 0 1 0 1 0 0 1 0 0 0 P(n trailing zeros) = ?
  • 32. Flajolet-Martin ...0 1 0 1 0 1 0 1 0 0 1 0 0 0 P(n trailing zeros) = (遜)^n # seen hashes ?
  • 33. x x x x x x x x 0 0 0 Flajolet-Martin ...0 1 0 1 0 1 0 1 0 0 1 0 0 0 P(n trailing zeros) = (遜)^n # seen hashes 2^n x x x x x x x x 0 0 1 x x x x x x x x 0 1 0 x x x x x x x x 0 1 1 x x x x x x x x 1 0 0 x x x x x x x x 1 0 1 x x x x x x x x 1 1 0 x x x x x x x x 1 1 1
  • 34. Flajolet-Martin 0Hash ...010011011 Element Hash function Hashed value Max number of trailing zeros x1 1Hash ...100101010x2 1Hash ...010011011x1 ... Hash ...010000000xn log2 (n)
  • 35. Flajolet-Martin 0Hash1 ...010011011 Element Hash functions Hashed value Max number of trailing zeros x1 3Hash.. ...111001000 0Hashk ...110100001 ... ...
  • 36. Flajolet-Martin - Results VS Na誰ve approach Flajolet-Martin 500 MB per link 5 PB with 10M links 1,5 KB per link 15 GB with 10M links 2% of error
  • 37. Probabilistic search Memopal - Search for similar files Local sensitive hashing & min hashing
  • 38. Use case The difference between a petabyte and a gigabyte index is worth an approximation.
  • 39. Search - Na誰ve approach 2 B files 1 PB of index Slow search
  • 40. Search - Min hash
  • 41. Day was departing, and the embrowned air Released the animals that are on earth From their fatigues; and I the only one Made myself ready to sustain the war, Both of the way and likewise of the woe, Which memory that errs not shall retrace. Similarity Midway upon the journey of our life I found myself within a forest dark, For the straightforward pathway had been lost. Ah me! how hard a thing it is to say What was this forest savage, rough, and stern, Which in the very thought renews the fear. Are they similar? Jaccard = Number of substrings in common Total number of unique substrings Document 1 Document 2
  • 42. Similarity Substrings => Shingles of length S Storage S * Doc_length * #Docs Complexity Doc_length * #Docs Set of shingles = ... Midway upon the, upon the journey, the journey of, ... Midway upon the journey of our life
  • 43. Similarity Fingerprint => 32 bit hash of a shingle Storage 4 byte * Doc_length * #Docs Complexity Doc_length * #Docs Set of shingles = 100101101 , 011010000, 110010011 ,
  • 44. Similarity We need to find a signature Sig(D) of length K so that if Sig(D1 ) ~ Sig(D2 ) then D1 ~ D2 Storage 4 byte * K * #Docs Complexity K * #Docs With K << Doc_length
  • 45. MinHash - Signature creation Doc1 10101 01100 10010 00111 Take a random permutation of the fingerprints. Generate the fingerprints of the documents. Define minhash(Hn , Doci ) = First fingerprint of Doci hashed with Hn Sig(Doci ) of length K = [minhashi , minhash2 , , minhashn ] Doc1 00111 01100 10101 10010 Minhash of this permutation Hn
  • 46. MinHash Signature(Doc1 ) 100101101 011010000 110010011 011100011 100100001 Sig(Doc) is a set of K min-hashing fingerprints: Signature(Docn ) 100001101 101010110 110010011 010100101 100100001
  • 47. MinHash If Sig(D1 ) ~ Sig(D2 ) then Doc1 ~ Doc2 P(X = 1) = Jaccard(Doc1 , Doc2 ) X / K Jaccard(Doc1 , Doc2 ) 100101101 011010000 110010011 011100011 100100001 100001101 101010110 110010011 010100101 100100001 Signature(Doc1 ) Signature(Doc2 )X 1 0 1 0 1
  • 48. MinHash - Implementation 1. Generate the fingerprints of the document 2. Define K hash functions: h1 , h2 , .... , hk . 3. Define Sig(Doc) = [h1 (Doc), h2 (Doc), ..., hk (Doc)] 4. Define O = { i / hi (Doc1 ) = hi (Doc2 ) } 5. Sim(Doc1 , Doc2 ) = Jaccard(Doc1 , Doc2 ) | O | K Storage 4 byte * K * #Docs Complexity K * #Docs With K << Doc_length
  • 49. Local Sensitive Hashing Signature(Doc) = 100101101 011010000 110010011 Divide the signature Sig(Doc) into B bands of R rows each, such that B*R = K: band 1 band 2 band ... band B } R fingerprints
  • 50. Threshold (1/B)^(1/R) Local Sensitive Hashing - Analysis Probability of a document having at least band in common: 1 - (1 - jR )B Jaccard of documents Probability of becoming a candidate S-curve R B
  • 51. Threshold (1/B)^(1/R) True Positive True Negative False Positive False Negative Local Sensitive Hashing - Analysis Probability of a document having at least band in common: 1 - (1 - jR )B Jaccard of documents Probability of becoming a candidate S-curve R B
  • 52. Probabilistic search - Results Storage Shingle_length * Doc_length * #Docs Complexity Doc_length * #Docs From: To: Storage 4 byte * K * #Docs Complexity K * #Docs * p(candidate) With K << Doc_length and p(candidate) << 1
  • 53. Probabilistic search - Results VS Na誰ve approach Min hash + LSH 2 B files 1 PB of index Slow search 2 B files 1,5 TB of index Fast search & update
  • 55. P(|questions| > 0) = 1 - [1 - p(question)]|audience| Any questions?