�ݺ�ߣ

Hash
A probabilistic approach for big data
Luca Mastrostefano

Who am I?
● Product manager of MyMemory at Translated
● IT background
● Algorithms lover
Luca Mastrostefano
luca@translated.net

Syllabus
Problem Use case
Fast and exact search Databases - Search
Stream filter Translated - MyMemory
Counting unique items in a stream ClickMeter - IPs analysis
Probabilistic search Memopal - Search for similar files

Search algorithms
Databases - Fast and exact search
Static, extendible and linear hash indexes

Use case
Sometimes also a logarithmic complexity is
too expensive.

B+
tree index
Images from Data Management - Maurizio Lenzerini
Select/Insert ≅ LogF
(# items)

Static hash index
Select/Insert ≅ 2 + (# overflow pages)
Directories

Dynamic hash index - Extendible
Select/Insert ≅
2 + (# overflow pages)
# overflow pages almost constant

Intuition:
● Avoid the directories to save one memory access.
● Split one bucket per time: it fits real-time environments!
Dynamic hash index - Linear
Select/Insert ≅
1 + (# overflow pages)
# overflow pages almost constant

4x in case of billions of entries
Select/Insert ≊ Log
VS
B+
tree index
Indexes comparison - Secondary memory accesses
Linear hash index
Select/Insert ≊ const
1 access ≊ 7 ms4 accesses ≊ 30 ms

Stream filter: x ∈ U ?
Translated - MyMemory
Bloom filter

Use case
The delay introduced by the secondary
memory does not fit an environment in which
milliseconds matter.

Stream filter - Naïve approach
60+ GB
Hash index (1,5B items)
Network delay
5% item ∈ Dataset
…

Bloom filter - Insert
0 0 0 0 0 0 0 0 0 0 0 0 0 0
n1
...
nn
n items to insert
h1
h2
h3
k hash functions
Bit array of length m

0 1 0 0 0 0 0 0 1 0 0 0 1 0
h1
h...
hk
n1

0 1 1 0 0 1 0 0 1 0 0 1 1 0
h1
h...
hk
nn

Bloom filter - Search
0 1 1 0 0 1 0 0 1 0 0 1 1 0
n
a
b
...
h1
h...
hk
Items to search for
Same hash
functions
Fixed bit array

Bloom filter - Search [No false negative]
0 1 1 0 0 1 0 0 1 0 0 1 1 0
h1
h...
hk
“a” DOES NOT belong to the
set
a
n
b
...

Bloom filter - Search [True positive]
0 1 1 0 0 1 0 0 1 0 0 1 1 0
h1
h...
hk
“n” MAY belong to the set
n
b
...

Bloom filter - Search [Possible false positive]
0 1 1 0 0 1 0 0 1 0 0 1 1 0
h1
h...
hk
b
...
“b” MAY belong to the set

Bloom filter - Analysis
n items to insert
k hash
functions
m bits
0 1 1 0 0 1 0 0 1 0 0 1 1 0
z
...
h1
h2
h3
b
...
h1
h...
hk
The probability of a false
positive is:
P =

Bloom filter - Implementation
n items to insertk hash
functions
m bits
● Optimal number of hash function:
● Optimal number of bit m for the
desired probability p of false positive:

Bloom filter - Results
7 hash functions
2 GB (14B bit)
60+ GB VS
Naïve approach Bloom filter
1% of false positive

Bloom filter - Results [MyMemory]
~5% of connections
60+ GB
Hash index (1,5B items)
…
2 GB
bloom filter

Counting unique items in a
stream
ClickMeter - Number of unique IPs per link
Flajolet - Martin for unique hash counting

Use case
Counting unique elements could be really
costly in terms of memory.

Counting unique items - Naïve approach
500 MB per link
(4B bits array)
... 1 1 0 0 1 0 0 1 0 0 1 1 ...
5 PB with 10M links
0.0.0.0 255.255.255.255

Counting unique items -
Flajolet-Martin

Flajolet-Martin
...0 1 0 1 0 1 0 1 0 0 1 0 0 0
P(n trailing zeros) = ?

Flajolet-Martin
...0 1 0 1 0 1 0 1 0 0 1 0 0 0
P(n trailing zeros) = (½)^n
# seen hashes ≅ ?

… x x x x x x x x 0 0 0
Flajolet-Martin
...0 1 0 1 0 1 0 1 0 0 1 0 0 0
P(n trailing zeros) = (½)^n
# seen hashes ≅ 2^n
… x x x x x x x x 0 0 1
… x x x x x x x x 0 1 0
… x x x x x x x x 0 1 1
… x x x x x x x x 1 0 0
… x x x x x x x x 1 0 1
… x x x x x x x x 1 1 0
… x x x x x x x x 1 1 1

Flajolet-Martin
0Hash ...010011011
Element Hash function Hashed value Max number of trailing zeros
x1
1Hash ...100101010x2
1Hash ...010011011x1
...
Hash ...010000000xn
log2
(n)

Flajolet-Martin
0Hash1
...010011011
Element Hash functions Hashed value Max number of trailing zeros
x1
3Hash..
...111001000
0Hashk
...110100001
...
...

Flajolet-Martin - Results
VS
Naïve approach Flajolet-Martin
500 MB per link
5 PB with 10M links
1,5 KB per link
15 GB with 10M links
2% of error

Probabilistic search
Memopal - Search for similar files
Local sensitive hashing & min hashing

Use case
The difference between a petabyte and a
gigabyte index is worth an approximation.

Search - Naïve approach
2 B files
1 PB of index
Slow search

Day was departing, and the
embrowned air
Released the animals that are
on earth
From their fatigues; and I the
only one
Made myself ready to sustain
the war,
Both of the way and likewise
of the woe,
Which memory that errs not
shall retrace.
Similarity
Midway upon the journey of
our life
I found myself within a forest
dark,
For the straightforward
pathway had been lost.
Ah me! how hard a thing it is
to say
What was this forest savage,
rough, and stern,
Which in the very thought
renews the fear.
Are they similar?
Jaccard =
Number of substrings in common
Total number of unique substrings
Document 1 Document 2

Similarity
Substrings => Shingles of length S
Storage ≅ S * Doc_length * #Docs
Complexity ≅ Doc_length * #Docs
Set of shingles =
...
“Midway upon the”,
“upon the journey”,
“the journey of”,
...
“Midway upon the journey of our life”

Similarity
Fingerprint => 32 bit hash of a shingle
Storage ≅ 4 byte * Doc_length * #Docs
Set of shingles =
…
… 100101101 …,
… 011010000…,
… 110010011 …,
…

Similarity
We need to find a signature Sig(D) of
length K so that
if Sig(D1
) ~ Sig(D2
) then D1
~ D2
Storage ≅ 4 byte * K * #Docs
Complexity ≅ K * #Docs
With K << Doc_length

MinHash - Signature creation
Doc1
…10101
…01100
…10010
…00111
Take a random permutation
of the fingerprints.
Generate the fingerprints
of the documents.
Define minhash(Hn
, Doci
) = First fingerprint of Doci
hashed with
Hn
Sig(Doci
) of length K = [minhashi
, minhash2
, …, minhashn
]
Doc1
…00111
…01100
…10101
…10010
Minhash of this
permutation
Hn

MinHash
Signature(Doc1
)
… 100101101 …
… 011010000…
… 110010011 …
… 011100011 …
… 100100001 …
…
Sig(Doc) is a set of K min-hashing fingerprints:
Signature(Docn
)
… 100001101 …
… 101010110…
… 110010011 …
… 010100101 …
… 100100001 …
…
…

MinHash
If Sig(D1
) ~ Sig(D2
) then Doc1
~ Doc2
P(X = 1) = Jaccard(Doc1
, Doc2
)
∑ X / K ≃ Jaccard(Doc1
, Doc2
)
… 100101101 …
… 011010000…
… 110010011 …
… 011100011 …
… 100100001 …
…
… 100001101 …
… 101010110…
… 110010011 …
… 010100101 …
… 100100001 …
…
Signature(Doc1
) Signature(Doc2
)X
1
0
1
0
1
…

MinHash - Implementation
1. Generate the fingerprints of the document
2. Define K hash functions: h1
, h2
, ....
, hk
.
3. Define Sig(Doc) = [h1
(Doc), h2
(Doc), ..., hk
(Doc)]
4. Define O = { i / hi
(Doc1
) = hi
(Doc2
) }
5. Sim(Doc1
, Doc2
) = ≃ Jaccard(Doc1
, Doc2
)
| O |
K
Complexity ≅ K * #Docs
With K << Doc_length

Local Sensitive Hashing
Signature(Doc) =
… 100101101 …
… 011010000…
… 110010011 …
…
…
…
Divide the signature Sig(Doc) into B bands of R rows each, such that B*R = K:
band 1
band 2
band ...
band B
} R fingerprints

● Threshold ≅ (1/B)^(1/R)
Local Sensitive Hashing - Analysis
Probability of a document having at least band in common: 1 - (1 - jR
)B
Jaccard of documents
Probability of
becoming a
candidate
S-curve
R
B

● Threshold ≅ (1/B)^(1/R)
● True Positive
● True Negative
● False Positive
● False Negative
Local Sensitive Hashing - Analysis
Probability of a document having at least band in common: 1 - (1 - jR
)B
Jaccard of documents
Probability of
becoming a
candidate
S-curve
R
B

Probabilistic search - Results
Storage ≅ Shingle_length * Doc_length * #Docs
From:
To:
Complexity ≅ K * #Docs * p(“candidate”)
With K << Doc_length and p(“candidate”) << 1

Probabilistic search - Results
VS
Naïve approach Min hash + LSH
2 B files
1 PB of index
Slow search
2 B files
1,5 TB of index
Fast search & update

P(|questions| > 0) = 1 - [1 - p(question)]|audience|
Any questions?

�ݺ�ߣ

Hash - A probabilistic approach for big data

Recommended

More Related Content

Similar to Hash - A probabilistic approach for big data (20)

Recently uploaded (20)

Hash - A probabilistic approach for big data