ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
DSH:DataSensitiveHashingfor
High-Dimensionalk-NNSearch
Choi1, Myung2, Lim1, Song2
1DataKnow. Lab. 2D&C Lab.
Korea Univ.
Jinyang Gao, H. V. Jagadish, Wei Lu, Beng Chin Ooi
SIGMOD `14
Motivation
3/ 12
App: Large Scale Image Search in Database
• Find similar images in a large database (e.g. google image search)
Kristen Grauman et al
slide: Yunchao Gong UNC Chapel Hill yunchao@cs.unc.edu
4/ 12
Feature Vector? High Dimension?
• Feature Vector: Example
• Nudity detection Alg. Based on Neural Network by Choi
• Image File (png) -> 8 x 8 vector (0, 0, 0, …, 0.3241, 0.00441, …)
• 현업ì—서는 ë” ë§Žì€ dimensionì˜ feature vector를 사용
5/ 12
Image Search, 그리고 kNN
• ì´ë¯¸ì§€ë¥¼ 나타내는 d-ì°¨ì›ì˜ feature vector 집합 𔻠⊂ â„ ð‘‘
• ð‘‘1, ð‘‘2 ∈ â„ ð‘‘ì— ëŒ€í•´
• Dist(ð‘‘1, ð‘‘2)ê°€ 작으면 ð‘‘1, ð‘‘2 ê°€ 서로 유사한 ì´ë¯¸ì§€ë¼ê³  하ìž.
• Dist(ð‘‘1, ð‘‘2)ê°€ í¬ë‹¤ë©´ ð‘‘1, ð‘‘2 ê°€ 서로 ìƒì´í•œ ì´ë¯¸ì§€ë¼ê³  하ìž.
• ì§ˆì˜ ì´ë¯¸ì§€ Q 를 â„ ð‘‘
공간 ìƒì˜ í•œ ì  ð‘ž ìœ¼ë¡œ 표현해보ìž
• ð‘žð‘¢ð‘’ð‘Ÿð‘¦ 𑞠∈ â„ ð‘‘
• Q 와 유사한 ì´ë¯¸ì§€ë¥¼ kê°œ ë§Œí¼ ì°¾ëŠ” 문제는 k-NN 문제로 변환 가능
• Return k − NN(ð‘ž, ð”»)
R-Tree 기반 kNN Search로
문제 해결 가능?
불가능:
Curse of dimensionality
6/ 12
Reality Check
• Curse of dimensionality
• [Qin lv et al, Image Similarity Search with Compact Data Structures @CIKM`04]
•
• poor performance when the number of dimensions is high
Roger Weber et al, A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-
Dimensional Spaces @ VLDB`98
7/ 12
Data Sensitive Hashing
• a Solution to the Approximate k-NN Problem in High-Dimensional Space
• 𛿠− recall K-NN Problem
• Recall:
|ð‘žð‘¢ð‘’ð‘Ÿð‘¦ ð‘Ÿð‘’ð‘ ð‘¢ð‘™ð‘¡ ð‘ ð‘’𑡠∩ ð¾ð‘ð‘ ð‘œð‘ð‘—ð‘’ð‘ð‘¡ð‘ |
| ð¾ð‘ð‘ ð‘œð‘ð‘—ð‘’ð‘ð‘¡ð‘  |
Curse of Dimensionality Recall ë°ì´í„° ë¶„í¬ ì˜í–¥ 기반 기술
Scan X (ì—†ìŒ) 1 X N/A
RTree-based Solution O (강함) 1 △ index: Tree
Locality Sensitive Hashing â–³ (ëœí•¨) < 1 O
Hashing
+ Mathematics
Data Sensitive Hashing â–³ (ëœí•¨) < 1 â–³
Hashing
+ Machine Learning
KNN
objects
Query
Result Set
Related Work: LSH
ð’“ ðŸ, ð’“ ðŸ, ð’‘ ðŸ, ð’‘ ðŸ − ð’”ð’†ð’ð’”ð’Šð’•ð’Šð’—ð’†
ð‘¯ð’‚ð’”ð’‰ð’Šð’ð’ˆ ð‘­ð’‚ð’Žð’Šð’ð’š ð“—
Randomly
extract functions
ℎ11, ℎ12 , … , ℎ1𑚠→ g1
ℎ21, ℎ22 , … , ℎ2𑚠→ g2
â„Žð‘™1, â„Žð‘™2 , … , â„Žð‘™ð‘š → g ð‘™
…
Generating functions
a. ð‘–ð‘“ ð‘‘ð‘–ð‘ ð‘¡ ð‘œ, 𑞠≤ ð‘Ÿ1 ð‘¡â„Žð‘’ð‘› Pr
â„‹
â„Ž ð‘ž = â„Ž 𑜠≥ ð‘1
b. ð‘–ð‘“ ð‘‘ð‘–ð‘ ð‘¡ ð‘œ, 𑞠≥ ð‘Ÿ2 ð‘¡â„Žð‘’ð‘› Pr
â„‹
[â„Ž ð‘ž = â„Ž(ð‘)] < p2
9/ 12
Locality Sensitive Hashing
• 100 ì°¨ì›ì˜ 실수 공간(â„100)ì—ì„œ KNN 문제를 풀어야 한다.
• What if!?
• 유사한 ì ì€ 서로 Collisionì´ ì¼ì–´ë‚˜ê³ ,
• ìƒì´í•œ ì ì€ Collisionì´ ì¼ì–´ë‚˜ì§€ 않는
• â„Žð‘–ð‘‘ð‘’ð‘Žð‘™:â„100 → ℤ+ì´ ìžˆë‹¤ë©´ 어떨까?
Query Point
그러나 ì´ëŸ¬í•œ ì´ìƒì ì¸
함수는 존재하지 ì•ŠìŒ
10/ 12
Formally,
ð’“ ðŸ, ð’“ ðŸ, ð’‘ ðŸ, ð’‘ ðŸ − ð’”ð’†ð’ð’”ð’Šð’•ð’Šð’—ð’†
ð‘¯ð’‚ð’”ð’‰ð’Šð’ð’ˆ ð‘­ð’‚ð’Žð’Šð’ð’š ð“—
a. ð‘–ð‘“ ð‘‘ð‘–ð‘ ð‘¡ ð‘œ, 𑞠≤ ð‘Ÿ1 ð‘¡â„Žð‘’ð‘› Pr
â„‹
â„Ž ð‘ž = â„Ž 𑜠≥ ð’‘ ðŸ
b. ð‘–ð‘“ ð‘‘ð‘–ð‘ ð‘¡ ð‘œ, 𑞠≥ ð‘Ÿ2 ð‘¡â„Žð‘’ð‘› Pr
â„‹
[â„Ž ð‘ž = â„Ž(ð‘)] ≤ ð© ðŸ
• informally,
a. ð‘–ð‘“(ë‘ ì ì´ 유사하다면) ð‘¡â„Žð‘’ð‘› ë‘ ì ì˜ hash 함수 ê°’ì´ ê°™ì„ í™•ë¥ ì´ ë†’(아야한)다.
b. ð‘–ð‘“(ë‘ ì ì´ 유사하지 않다면) ð‘¡â„Žð‘’ð‘› ë‘ ì ì˜ hash 함수 ê°’ì´ ê°™ì„ í™•ë¥ ì´ ë‚®(아야한)다.
• Intuitively,
• ð’‘ ðŸ = ðŸ, ð’‘ ðŸ = ðŸŽì¸ Hash 함수를 만들 수 있다면?
• 그러나 ì´ëŸ¬í•œ ì´ìƒì ì¸ 함수는 존재하지 ì•ŠìŒ
• Challenging
• ℋ를 ë„출하는 것 ìžì²´ê°€ 수학ì ìœ¼ë¡œ 어려움!
• ë„출했다 하ë”ë¼ë„ 대체로 ð‘1는 낮으며, p2는 높ìŒ
ë¬¸ì œì  1:
ë„ì¶œì€ ê°€ëŠ¥í•˜ë‚˜,
ð’‘ ðŸê°€ 너무 높다 (낮아야 하는ë°!)
11/ 12
Random projection (backup slide 참조)
• Formally
a. ð‘–ð‘“ ð‘‘ð‘–ð‘ ð‘¡ ð‘œ, 𑞠≤ ð‘Ÿ1 ð‘¡â„Žð‘’ð‘› Pr
â„‹
â„Ž ð‘ž = â„Ž 𑜠≥ ð’‘ ðŸ
b. ð‘–ð‘“ ð‘‘ð‘–ð‘ ð‘¡ ð‘œ, 𑞠≥ ð‘Ÿ2 ð‘¡â„Žð‘’ð‘› Pr
â„‹
[â„Ž ð‘ž = â„Ž(ð‘)] ≤ ð’‘ ðŸ
slide: Yunchao Gong UNC Chapel Hill yunchao@cs.unc.edu
ë¬¸ì œì  1:
ë„ì¶œì€ ê°€ëŠ¥í•˜ë‚˜,
ð’‘ ðŸê°€ 너무 높다 (낮아야 하는ë°!)
í•´ê²°ì±… 1:
함수를 여러 개로(𒎠개)
묶어서 사용해보ìž!
0
1
12/ 12
m-concatination
• let 𑔠𑥠= (ℎ1 𑥠,ℎ2 𑥠,…,ℎ 𑚠𑥠)
• 거리가 먼 ë‘ ì  q, pì— ëŒ€í•´
• Pr
ð‘”∈ℊ
[ð‘” ð‘ž = ð‘”(ð‘)] ≤ Pr
ℎ∈ℋ
â„Ž ð‘ž = â„Ž ð‘
ð‘š
≤ ð’‘ ðŸ
ð’Ž
≪ ð’‘ ðŸ
0
1
0
1
0
1Fergus et al
slide: Yunchao Gong UNC Chapel Hill yunchao@cs.unc.edu
ë¬¸ì œì  1:
ë„ì¶œì€ ê°€ëŠ¥í•˜ë‚˜,
ð’‘ ðŸê°€ 너무 높다 (낮아야 하는ë°!)
í•´ê²°ì±… 1:
함수를 여러 개로(𒎠개)
묶어서 사용해보ìž!
효과: false positive ê°ì†Œ
유사하지 ì•Šì€ ë‘ ì ì— 대해
Pr
ð‘”∈ℊ
[ð‘” ð‘ž = ð‘”(ð‘)] ≤ ð’‘ ðŸ
ð’Ž
≪ ð’‘ ðŸ
101
001
100 111
13/ 12
Random projection
• ð’‘ ðŸì´ 아주 ë†’ì€ 0.8ì´ë¼ê³  하ë”ë¼ë„
• ð’Ž=5ì´ë¼ë©´,
• 유사한 ë‘ ì ì— 대해 Pr
ð‘”∈ℊ
[ð‘” ð‘ž = ð‘”(ð‘)] ≥ 0.33
• 즉, 만약 í•œ ê°œì˜ ð‘”ë¡œ Hash table 구성 ì‹œ,
• ì§ˆì˜ ì§€ì  q와 아주 유사한 ì ì˜ 수가 100ê°œë¼ë©´
• ê·¸ 중 33ê°œ ì´ìƒ 찾는 ê²ƒì„ ë³´ìž¥í•´ì£¼ê² ë‹¤ëŠ” 뜻
• ë‚®ì€ Recallì„ ê°–ê²Œ ë¨!
• ð’=5ë¼ë©´, 1 − 1 − ð’‘ ðŸ
ð’Ž ð’
≥ 0.86ì´ë¯€ë¡œ,
• í‰ê· ì ìœ¼ë¡œ 86ê°œ ì´ìƒ ì°¾ì„ ìˆ˜ 있다는 뜻
slide: Yunchao Gong UNC Chapel Hill yunchao@cs.unc.edu
ë¬¸ì œì  1:
ë„ì¶œì€ ê°€ëŠ¥í•˜ë‚˜,
ð’‘ ðŸê°€ 너무 높다 (낮아야 하는ë°!)
í•´ê²°ì±… 1:
함수를 여러 개로(𑚠개)
묶어서 사용해봤다!
효과: false positive ê°ì†Œ
유사하지 ì•Šì€ ë‘ ì ì— 대해
Pr
ð‘”∈ℊ
[ð‘” ð‘ž = ð‘”(ð‘)] ≤ ð’‘ ðŸ
ð’Ž
≪ ð’‘ ðŸ
역효과: false negativeë„ ì¦ê°€
유사한 ë‘ ì ì— 대해
Pr
ð‘”∈ℊ
[ð‘” ð‘ž = ð‘”(ð‘)] > ð’‘ ðŸ ≫ ð’‘ ðŸ
ð’Ž
ë¬¸ì œì  2:
ð’‘ ðŸ
ð’Ž
ê°€ 낮아지는 바람ì—,
ð’‘ ðŸ
ð’Ž
ë„ ë‚®ì•„ì¡Œë‹¤ (높아야 하는ë°!)
í•´ê²°ì±… 2:
g를 여러 ê°œ (ð’ ê°œ) 사용한 후
ê·¸ 중ì—ì„œ k-NNì„ ì°¾ìž!
효과: High Recall
즉, 1 − 1 − ð’‘ ðŸ
ð’Ž ð’ë¼ëŠ”
recallì„ ë‹¬ì„±í•  수 있ìŒ
14/ 12
Structure
• LSH
• a Set of Hash tables Hi 1 ≤ i ≤ ð‘™}
• Hash Function ð‘”i:â„100
→ 0,1 ð‘š
ð‘“ð‘œð‘Ÿ Hi
• for example, 𑚠= 6, 𑙠= 26
Key Bucket
000000
000001
...
111111
H1
Key Bucket
000000
000001
...
111111
H2
Key Bucket
000000
000001
...
111111
H26
...
ð‘”1 ð‘”2 ð‘”26
15/ 12
Processing: ë„ì‹í™”
• Query Pont q = 984.29,946.23,…,848.21
• Processing
• Step 1. Candidate Set C = ð‘–=1 ð‘¡ð‘œ26 Hi. ð‘”ð‘’ð‘¡(ð‘ž)
• Step 2. return k_Nearest_Neighbors(q) in C
• linear search
Key Bucket
000000
000001
...
111111
H1
Key Bucket
000000
000001
...
111111
H2
Key Bucket
000000
000001
...
111111
H26
...ð‘”1
ð‘”2
ð‘”26
16/ 12
Formally,
ð’“ ðŸ, ð’“ ðŸ, ð’‘ ðŸ, ð’‘ ðŸ − ð’”ð’†ð’ð’”ð’Šð’•ð’Šð’—ð’†
ð‘¯ð’‚ð’”ð’‰ð’Šð’ð’ˆ ð‘­ð’‚ð’Žð’Šð’ð’š ð“—
Randomly
extract functions
ℎ11, ℎ12 , … , ℎ1𑚠→ g1
ℎ21, ℎ22 , … , ℎ2𑚠→ g2
â„Žð‘™1, â„Žð‘™2 , … , â„Žð‘™ð‘š → g ð‘™
…
Generating functions
a. ð‘–ð‘“ ð‘‘ð‘–ð‘ ð‘¡ ð‘œ, 𑞠≤ ð‘Ÿ1 ð‘¡â„Žð‘’ð‘› Pr
â„‹
â„Ž ð‘ž = â„Ž 𑜠≥ ð‘1
b. ð‘–ð‘“ ð‘‘ð‘–ð‘ ð‘¡ ð‘œ, 𑞠≥ ð‘Ÿ2 ð‘¡â„Žð‘’ð‘› Pr
â„‹
[â„Ž ð‘ž = â„Ž(ð‘)] < p2
Traditional LSH Technique:
â‘  Derive â„‹ mathematically
â‘¡ prove that a. and b. holds for an
arbitrary h ∈ â„‹ w.r.t. parameter ð‘Ÿ1, ð‘Ÿ2
â‘¢ Randomly extract functions and
build Hash Table.
In DSH(Data Sensitive Hashing):
â‘  learn â„Ž by using adaptive boosting
and ℋ = ℋ ∪ {ℎ}
â‘¡ If â„‹ is not sufficient to guarantee
that a. and b. holds w.r.t ∀ℎ ∈ ℋ go to ①
â‘¢ Randomly extract functions and build
Hash Table.
17/ 12
LSH VS DSH
Traditional LSH Technique:
â‘  Derive â„‹ mathematically
â‘¡ prove that a. and b. holds for an
arbitrary h ∈ â„‹ w.r.t. parameter ð‘Ÿ1, ð‘Ÿ2
â‘¢ Randomly extract functions and
build Hash Table.
In DSH(Data Sensitive Hashing):
â‘  learn â„Ž by using adaptive boosting
and ℋ = ℋ ∪ {ℎ}
â‘¡ If â„‹ is not sufficient to guarantee
that a. and b. holds w.r.t ∀ℎ ∈ ℋ go to ①
â‘¢ Randomly extract functions and build
Hash Table.
ë°ì´í„° ë¶„í¬ ê³ ë ¤ 기반 기술
Locality Sensitive Hashing
X
(애당초 Uniform Distributionì„ ê°€ì •í–ˆê¸° ë•Œë¬¸ì— (for â‘¡))
Hashing
+ Mathematics
Data Sensitive Hashing
O
(ëŒ€ìƒ ë°ì´í„° 분í¬ë¥¼ 기준으로 강제로 h를 뽑아 내기 때문ì—)
Hashing
+ Machine Learning
18/ 12
LSH VS DSH 2
Sensitive 기반 기술
Locality Sensitive Hashing ð‘Ÿ1, ð‘Ÿ2ì— ë”°ë¼ Sensitiveí•œ Hashing
Hashing
+ Mathematics
Data Sensitive Hashing Data (k-NN ê³¼ non-ck-NN) ì— Sensitiveí•œ Hashing
Hashing
+ Machine Learning
DSH: demonstration
20/ 12
Example: Data Set
• 100-dimensional data set ð·
• ð· = 100
• 10 clusters
21/ 12
Build DSH for D
• DSH dsh = new DSH(10, 1.1, 0.7, 0.6, 0.4, querySet, dataSet);
Parameter Value
k (k-NN) 10
𛼠(학습률) 1.1
𛿠(lower bound of recall) 70%
ð‘1 0.6
ð‘2 0.4
Query Set D
Data Set D
22/ 12
Structure
• DSH
• a Set of Hash tables Hi 1 ≤ i ≤ ð‘™}
• Hash Function ð‘”i:â„100
→ 0,1 ð‘š
ð‘“ð‘œð‘Ÿ Hi
• for example, 𑚠= 6, 𑙠= 26
Key Bucket
000000
000001
...
111111
H1
Key Bucket
000000
000001
...
111111
H2
Key Bucket
000000
000001
...
111111
H26
...
ð‘”1 ð‘”2 ð‘”26
23/ 12
Query Example
• res = dsh.k_Nearest_Neighbor(q=new Point(984.29, 946.23, ..., 848.21)));
• return 10-aNN objs from the given point q
• DSH’s Property:
• Result set must include at least 70% of the exact 10-NN objs
• Result:
Query Point p:
(984.29, 946.23, ..., 848.21)
10-aNN of P (recall: 100%)
24/ 12
Processing: dsh.k_Nearest_Neighbor(q)
• Query Pont q = 984.29,946.23,…,848.21
• Processing
• Step 1. Candidate Set C = ð‘–=1 ð‘¡ð‘œ26 Hi. ð‘”ð‘’ð‘¡(ð‘ž)
• Step 2. return k_Nearest_Neighbors(q) in C
• linear search
Key Bucket
000000
000001
...
111111
H1
Key Bucket
000000
000001
...
111111
H2
Key Bucket
000000
000001
...
111111
H26
...ð‘”1
ð‘”2
ð‘”26
25/ 12
Hi. ð‘”ð‘’ð‘¡(ð‘ž)
• Query Pont q = 984.29,946.23,…,848.21
• H1. ð‘”ð‘’ð‘¡(ð‘ž) = H1 ð‘”1(ð‘ž) = H1 1110102 = H1[5810]
• g1 q = (ℎ11 𑞠,ℎ12 𑞠,…,ℎ16 𑞠) = (1,1,0,0,1,0)
<H1>
26/ 12
Processing: dsh.k_Nearest_Neighbor(q)
• for each Hð‘–
• H1. ð‘”ð‘’ð‘¡(ð‘ž) ={Data(id=93~98, 100~102)}
• H2. ð‘”ð‘’ð‘¡ ð‘ž = ...
• ...
• H26. ð‘”ð‘’ð‘¡ ð‘ž = ...
• Candidate Set C = ð‘–=1 ð‘¡ð‘œ 26 Hi. ð‘”ð‘’ð‘¡(ð‘ž)
• dsh.k_Nearest_Neighbor(q)
• = k_Nearest_Neighbors(q) in C
27/ 12
Processing: dsh.k_Nearest_Neighbor(q)
• for each Hð‘–
• H1. ð‘”ð‘’ð‘¡(ð‘ž) ={Data(id=93~98, 100~102)}
• H2. ð‘”ð‘’ð‘¡ ð‘ž = ...
• ...
• H26. ð‘”ð‘’ð‘¡ ð‘ž = ...
• Candidate Set C = ð‘–=1 ð‘¡ð‘œ 26 Hi. ð‘”ð‘’ð‘¡(ð‘ž)
28/ 12
Processing: dsh.k_Nearest_Neighbor(q)
• Candidate Set C = {ð‘‘ð‘Žð‘¡ð‘Ž93, ð‘‘ð‘Žð‘¡ð‘Ž94,…}
• C = 28
• result <- Find k-NN(q) in C
• dsh.k_Nearest_Neighbor(q)
• return result Query Pont q = 984.29, 946.23, … , 848.21 T
How to build DSH for D?
30/ 12
Build DSH for D
• Step 1. Generate ð“—, Data Sensitive Hashing Family (Chapter 3-4)
• Step 2. Generate Hash Function by Randomly extracting hash functions
Generating Hashing Family
Randomly
extract functions
ℎ11, ℎ12 , … , ℎ1𑚠→ g1
ℎ21, ℎ22 , … , ℎ2𑚠→ g2
â„Žð‘™1, â„Žð‘™2 , … , â„Žð‘™ð‘š → g ð‘™
…
Generating functions
•Step 3. for each g ð‘™,
•Initialize Hash Table ð‘‡ð‘™ for g ð‘™ (<key, value> = <Integer array, Data>)
•for each o ∈ ð·, ð‘‡ð‘™.put(g ð‘™ o , o)
Chapter 3
Chapter 4
Adaptive Boosting: Principle
34/ 12
a Weak Classifier
• a Weak Classifier ðœ‘(< ð‘žð‘–, ð‘œð‘— >) is a function
• Input: <query ð‘žð‘–, data ð‘œð‘—>pair
• Desired output:
0, ð‘–ð‘“ ð‘œð‘— ∈ ð‘˜ð‘ð‘(ð‘žð‘–)
1, ð‘–ð‘“ ð‘œð‘— ∉ ð‘ð‘˜ð‘ð‘(ð‘žð‘–)
a Weak Classifier
kNN Pair < ð‘žð‘–, ð‘œð‘— >
0 (correct)
a Weak Classifier
non-ckNN Pair < ð‘žð‘–, ð‘œð‘— >
0 (incorrect)
a Weak Classifier
kNN Pair < ð‘žð‘–, ð‘œð‘— >
1 (incorrect)
a Weak Classifier
non-ckNN Pair < ð‘žð‘–, ð‘œð‘— >
1 (correct)
note:
a Weak Classifier may produce
a lot of incorrect result
35/ 12
Weak Classifier 3
Weak Classifier 2
Adaptive Boosting
• Build Strong Classifier by combining several weak classifiers
Weak Classifier 1
1st : Query-Data Pair
(< ð’’ð’Š, ð’ð’‹ >) Set
weakclassifiertrainer
test
Well
Classified
Pair
Badly
Classified
Pair
Feed back
2nd : Query-Data Pair
(< ð’’ð’Š, ð’ð’‹ >) Set
Well
Classified
Pair
Badly
Classi
fied
Pair
Feed back
3rd : Query-Data Pair
(< ð’’ð’Š, ð’ð’‹ >) Set
Well Classified Pair
36/ 12
Weak Classifier 3
Weak Classifier 2
a Strong Classifier
• Build Strong Classifier by combining several weak classifiers
Weak Classifier 1
Query-Data Pair
(< ð’’ð’Š, ð’ð’‹ >) Set
a Strong Classifier
Badly
Classi
fied
Pair
Well Classified Pair
37/ 12
Adaptive Boosting
• Build Strong Classifier by combining several weak classifiers
Weak Classifier 3
Weak Classifier 2
Weak Classifier 1
1st : Query-Data Pair
(< ð’’ð’Š, ð’ð’‹ >) Set
weakclassifiertrainer
test
Well
Classified
Pair
Badly
Classified
Pair
Feed back
2nd : Query-Data Pair
(< ð’’ð’Š, ð’ð’‹ >) Set
Well
Classified
Pair
Badly
Classi
fied
Pair
3rd : Query-Data Pair
(< ð’’ð’Š, ð’ð’‹ >) Set
Well Classified Pair
Single Hash Function Optimization
39/ 12
Notation
• Query Set Q = (Q1,Q2,…,Qq)
• Data Set X = (X1,X2,…,Xn)
• Weight Matrix W
• Wij =
1,if Xj is a k − NN of Qi
−1,if Xj is a (sð‘Žð‘šð‘ð‘™ð‘’ð‘‘) non − ck − NN of Qi
0, ð‘’ð‘™ð‘ ð‘’
1
2
1 2 3 4
( )1 1 0 -1
-1 0 1 1
1 42 3
1 2
1 41 2 23
k = 2, c =
3
2
sampling rate = 1
40/ 12
Objective
• ð‘Žð‘Ÿð‘”min
â„Ž ð‘–ð‘— ðœ‘â„Ž < ð‘„ð‘–, ð‘‹ð‘— > ∙ ð‘Šð‘–ð‘—
• =ð‘Žð‘Ÿð‘”min
â„Ž ð‘–ð‘— â„Ž ð‘„𑖠− â„Ž ð‘‹ð‘—
2
∙ ð‘Šð‘–ð‘—

More Related Content

What's hot (20)

RLCode와 A3C 쉽고 깊게 ì´í•´í•˜ê¸°
RLCode와 A3C 쉽고 깊게 ì´í•´í•˜ê¸°RLCode와 A3C 쉽고 깊게 ì´í•´í•˜ê¸°
RLCode와 A3C 쉽고 깊게 ì´í•´í•˜ê¸°
Woong won Lee
Ìý
GAN with Mathematics
GAN with MathematicsGAN with Mathematics
GAN with Mathematics
Hyeongmin Lee
Ìý
Les net
Les netLes net
Les net
heedaeKwon
Ìý
Image processing - Histogram Equalization
Image processing - Histogram EqualizationImage processing - Histogram Equalization
Image processing - Histogram Equalization
우진 신
Ìý
Anomaly Detection with GANs
Anomaly Detection with GANsAnomaly Detection with GANs
Anomaly Detection with GANs
í™ë°° ê¹€
Ìý
Recurrent Neural Netì˜ ì´ë¡ ê³¼ 설명
Recurrent Neural Netì˜ ì´ë¡ ê³¼ 설명Recurrent Neural Netì˜ ì´ë¡ ê³¼ 설명
Recurrent Neural Netì˜ ì´ë¡ ê³¼ 설명
í™ë°° ê¹€
Ìý
[한글] Tutorial: Sparse variational dropout
[한글] Tutorial: Sparse variational dropout[한글] Tutorial: Sparse variational dropout
[한글] Tutorial: Sparse variational dropout
Wuhyun Rico Shin
Ìý
Reinforcement learning v0.5
Reinforcement learning v0.5Reinforcement learning v0.5
Reinforcement learning v0.5
SANG WON PARK
Ìý
Visualizing data using t-SNE
Visualizing data using t-SNEVisualizing data using t-SNE
Visualizing data using t-SNE
í™ë°° ê¹€
Ìý
Focal lossì˜ ì‘ìš©(Detection & Classification)
Focal lossì˜ ì‘ìš©(Detection & Classification)Focal lossì˜ ì‘ìš©(Detection & Classification)
Focal lossì˜ ì‘ìš©(Detection & Classification)
í™ë°° ê¹€
Ìý
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
Taehoon Kim
Ìý
RUCK 2017 ë¹…ë°ì´í„° 분ì„ì—ì„œ ëª¨í˜•ì˜ ì—­í• 
RUCK 2017 ë¹…ë°ì´í„° 분ì„ì—ì„œ ëª¨í˜•ì˜ ì—­í• RUCK 2017 ë¹…ë°ì´í„° 분ì„ì—ì„œ ëª¨í˜•ì˜ ì—­í• 
RUCK 2017 ë¹…ë°ì´í„° 분ì„ì—ì„œ ëª¨í˜•ì˜ ì—­í• 
r-kor
Ìý
[기초개ë…] Graph Convolutional Network (GCN)
[기초개ë…] Graph Convolutional Network (GCN)[기초개ë…] Graph Convolutional Network (GCN)
[기초개ë…] Graph Convolutional Network (GCN)
Donghyeon Kim
Ìý
Deep learning study 1
Deep learning study 1Deep learning study 1
Deep learning study 1
San Kim
Ìý
Unsupervised anomaly detection with generative model
Unsupervised anomaly detection with generative modelUnsupervised anomaly detection with generative model
Unsupervised anomaly detection with generative model
TaeKang Woo
Ìý
neural network 기초
neural network 기초neural network 기초
neural network 기초
Dea-hwan Ki
Ìý
알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder
í™ë°° ê¹€
Ìý
Vs^3 net for machine reading comprehension question answering
Vs^3 net for machine reading comprehension question answeringVs^3 net for machine reading comprehension question answering
Vs^3 net for machine reading comprehension question answering
NAVER Engineering
Ìý
03.12 cnn backpropagation
03.12 cnn backpropagation03.12 cnn backpropagation
03.12 cnn backpropagation
Dea-hwan Ki
Ìý
Optimization algorithms in machine learning
Optimization algorithms in machine learningOptimization algorithms in machine learning
Optimization algorithms in machine learning
Yonsei University
Ìý
RLCode와 A3C 쉽고 깊게 ì´í•´í•˜ê¸°
RLCode와 A3C 쉽고 깊게 ì´í•´í•˜ê¸°RLCode와 A3C 쉽고 깊게 ì´í•´í•˜ê¸°
RLCode와 A3C 쉽고 깊게 ì´í•´í•˜ê¸°
Woong won Lee
Ìý
GAN with Mathematics
GAN with MathematicsGAN with Mathematics
GAN with Mathematics
Hyeongmin Lee
Ìý
Image processing - Histogram Equalization
Image processing - Histogram EqualizationImage processing - Histogram Equalization
Image processing - Histogram Equalization
우진 신
Ìý
Anomaly Detection with GANs
Anomaly Detection with GANsAnomaly Detection with GANs
Anomaly Detection with GANs
í™ë°° ê¹€
Ìý
Recurrent Neural Netì˜ ì´ë¡ ê³¼ 설명
Recurrent Neural Netì˜ ì´ë¡ ê³¼ 설명Recurrent Neural Netì˜ ì´ë¡ ê³¼ 설명
Recurrent Neural Netì˜ ì´ë¡ ê³¼ 설명
í™ë°° ê¹€
Ìý
[한글] Tutorial: Sparse variational dropout
[한글] Tutorial: Sparse variational dropout[한글] Tutorial: Sparse variational dropout
[한글] Tutorial: Sparse variational dropout
Wuhyun Rico Shin
Ìý
Reinforcement learning v0.5
Reinforcement learning v0.5Reinforcement learning v0.5
Reinforcement learning v0.5
SANG WON PARK
Ìý
Visualizing data using t-SNE
Visualizing data using t-SNEVisualizing data using t-SNE
Visualizing data using t-SNE
í™ë°° ê¹€
Ìý
Focal lossì˜ ì‘ìš©(Detection & Classification)
Focal lossì˜ ì‘ìš©(Detection & Classification)Focal lossì˜ ì‘ìš©(Detection & Classification)
Focal lossì˜ ì‘ìš©(Detection & Classification)
í™ë°° ê¹€
Ìý
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
Taehoon Kim
Ìý
RUCK 2017 ë¹…ë°ì´í„° 분ì„ì—ì„œ ëª¨í˜•ì˜ ì—­í• 
RUCK 2017 ë¹…ë°ì´í„° 분ì„ì—ì„œ ëª¨í˜•ì˜ ì—­í• RUCK 2017 ë¹…ë°ì´í„° 분ì„ì—ì„œ ëª¨í˜•ì˜ ì—­í• 
RUCK 2017 ë¹…ë°ì´í„° 분ì„ì—ì„œ ëª¨í˜•ì˜ ì—­í• 
r-kor
Ìý
[기초개ë…] Graph Convolutional Network (GCN)
[기초개ë…] Graph Convolutional Network (GCN)[기초개ë…] Graph Convolutional Network (GCN)
[기초개ë…] Graph Convolutional Network (GCN)
Donghyeon Kim
Ìý
Deep learning study 1
Deep learning study 1Deep learning study 1
Deep learning study 1
San Kim
Ìý
Unsupervised anomaly detection with generative model
Unsupervised anomaly detection with generative modelUnsupervised anomaly detection with generative model
Unsupervised anomaly detection with generative model
TaeKang Woo
Ìý
neural network 기초
neural network 기초neural network 기초
neural network 기초
Dea-hwan Ki
Ìý
알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder
í™ë°° ê¹€
Ìý
Vs^3 net for machine reading comprehension question answering
Vs^3 net for machine reading comprehension question answeringVs^3 net for machine reading comprehension question answering
Vs^3 net for machine reading comprehension question answering
NAVER Engineering
Ìý
03.12 cnn backpropagation
03.12 cnn backpropagation03.12 cnn backpropagation
03.12 cnn backpropagation
Dea-hwan Ki
Ìý
Optimization algorithms in machine learning
Optimization algorithms in machine learningOptimization algorithms in machine learning
Optimization algorithms in machine learning
Yonsei University
Ìý

Viewers also liked (6)

Economic Development and Satellite Images
Economic Development and Satellite ImagesEconomic Development and Satellite Images
Economic Development and Satellite Images
Paul Raschky
Ìý
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종
WooSung Choi
Ìý
[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributions[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributions
WooSung Choi
Ìý
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slide
WooSung Choi
Ìý
Locality sensitive hashing
Locality sensitive hashingLocality sensitive hashing
Locality sensitive hashing
Sameera Horawalavithana
Ìý
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
Andrii Gakhov
Ìý
Economic Development and Satellite Images
Economic Development and Satellite ImagesEconomic Development and Satellite Images
Economic Development and Satellite Images
Paul Raschky
Ìý
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종
WooSung Choi
Ìý
[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributions[Vldb 2013] skyline operator on anti correlated distributions
[Vldb 2013] skyline operator on anti correlated distributions
WooSung Choi
Ìý
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slide
WooSung Choi
Ìý
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
Andrii Gakhov
Ìý

Similar to Dsh data sensitive hashing for high dimensional k-nn search (20)

Neural network (perceptron)
Neural network (perceptron)Neural network (perceptron)
Neural network (perceptron)
Jeonghun Yoon
Ìý
Multinomial classification and application of ML
Multinomial classification and application of MLMultinomial classification and application of ML
Multinomial classification and application of ML
í¬ìˆ˜ ë°•
Ìý
3.neural networks
3.neural networks3.neural networks
3.neural networks
Haesun Park
Ìý
Deep Learning from scratch 4장 : neural network learning
Deep Learning from scratch 4장 : neural network learningDeep Learning from scratch 4장 : neural network learning
Deep Learning from scratch 4장 : neural network learning
JinSooKim80
Ìý
ESM Mid term Review
ESM Mid term ReviewESM Mid term Review
ESM Mid term Review
Mario Cho
Ìý
개발ìžë¥¼ 위한 ê³µê°ì„¸ë¯¸ë‚˜ tensor-flow
개발ìžë¥¼ 위한 ê³µê°ì„¸ë¯¸ë‚˜ tensor-flow개발ìžë¥¼ 위한 ê³µê°ì„¸ë¯¸ë‚˜ tensor-flow
개발ìžë¥¼ 위한 ê³µê°ì„¸ë¯¸ë‚˜ tensor-flow
양 한빛
Ìý
SVM
SVMSVM
SVM
Jeonghun Yoon
Ìý
파ì´ì¬ê³¼ ì¼€ë¼ìŠ¤ë¡œ 배우는 강화학습 ì €ìžíŠ¹ê°•
파ì´ì¬ê³¼ ì¼€ë¼ìŠ¤ë¡œ 배우는 강화학습 ì €ìžíŠ¹ê°•íŒŒì´ì¬ê³¼ ì¼€ë¼ìŠ¤ë¡œ 배우는 강화학습 ì €ìžíŠ¹ê°•
파ì´ì¬ê³¼ ì¼€ë¼ìŠ¤ë¡œ 배우는 강화학습 ì €ìžíŠ¹ê°•
Woong won Lee
Ìý
03. linear regression
03. linear regression03. linear regression
03. linear regression
Jeonghun Yoon
Ìý
Deep Learning from scratch 5장 : backpropagation
 Deep Learning from scratch 5장 : backpropagation Deep Learning from scratch 5장 : backpropagation
Deep Learning from scratch 5장 : backpropagation
JinSooKim80
Ìý
2.linear regression and logistic regression
2.linear regression and logistic regression2.linear regression and logistic regression
2.linear regression and logistic regression
Haesun Park
Ìý
2.supervised learning(epoch#2)-3
2.supervised learning(epoch#2)-32.supervised learning(epoch#2)-3
2.supervised learning(epoch#2)-3
Haesun Park
Ìý
ì¸ê³µì§€ëŠ¥, 기계학습 그리고 딥러ë‹
ì¸ê³µì§€ëŠ¥, 기계학습 그리고 딥러ë‹ì¸ê³µì§€ëŠ¥, 기계학습 그리고 딥러ë‹
ì¸ê³µì§€ëŠ¥, 기계학습 그리고 딥러ë‹
Jinwon Lee
Ìý
Deep Learning from scratch 3장 : neural network
Deep Learning from scratch 3장 : neural networkDeep Learning from scratch 3장 : neural network
Deep Learning from scratch 3장 : neural network
JinSooKim80
Ìý
02.09 naive bayesian classifier
02.09 naive bayesian classifier02.09 naive bayesian classifier
02.09 naive bayesian classifier
Dea-hwan Ki
Ìý
ë‚´ê°€ ì´í•´í•˜ëŠ” SVM(왜, 어떻게를 중심으로)
ë‚´ê°€ ì´í•´í•˜ëŠ” SVM(왜, 어떻게를 중심으로)ë‚´ê°€ ì´í•´í•˜ëŠ” SVM(왜, 어떻게를 중심으로)
ë‚´ê°€ ì´í•´í•˜ëŠ” SVM(왜, 어떻게를 중심으로)
SANG WON PARK
Ìý
Vae
VaeVae
Vae
Lee Gyeong Hoon
Ìý
Eigendecomposition and pca
Eigendecomposition and pcaEigendecomposition and pca
Eigendecomposition and pca
Jinhwan Suk
Ìý
하스켈 성능 튜ë‹
하스켈 성능 튜ë‹í•˜ìŠ¤ì¼ˆ 성능 튜ë‹
하스켈 성능 튜ë‹
ë¯¼ì„ ì´
Ìý
04. logistic regression ( 로지스틱 회귀 )
04. logistic regression ( 로지스틱 회귀 )04. logistic regression ( 로지스틱 회귀 )
04. logistic regression ( 로지스틱 회귀 )
Jeonghun Yoon
Ìý
Neural network (perceptron)
Neural network (perceptron)Neural network (perceptron)
Neural network (perceptron)
Jeonghun Yoon
Ìý
Multinomial classification and application of ML
Multinomial classification and application of MLMultinomial classification and application of ML
Multinomial classification and application of ML
í¬ìˆ˜ ë°•
Ìý
3.neural networks
3.neural networks3.neural networks
3.neural networks
Haesun Park
Ìý
Deep Learning from scratch 4장 : neural network learning
Deep Learning from scratch 4장 : neural network learningDeep Learning from scratch 4장 : neural network learning
Deep Learning from scratch 4장 : neural network learning
JinSooKim80
Ìý
ESM Mid term Review
ESM Mid term ReviewESM Mid term Review
ESM Mid term Review
Mario Cho
Ìý
개발ìžë¥¼ 위한 ê³µê°ì„¸ë¯¸ë‚˜ tensor-flow
개발ìžë¥¼ 위한 ê³µê°ì„¸ë¯¸ë‚˜ tensor-flow개발ìžë¥¼ 위한 ê³µê°ì„¸ë¯¸ë‚˜ tensor-flow
개발ìžë¥¼ 위한 ê³µê°ì„¸ë¯¸ë‚˜ tensor-flow
양 한빛
Ìý
파ì´ì¬ê³¼ ì¼€ë¼ìŠ¤ë¡œ 배우는 강화학습 ì €ìžíŠ¹ê°•
파ì´ì¬ê³¼ ì¼€ë¼ìŠ¤ë¡œ 배우는 강화학습 ì €ìžíŠ¹ê°•íŒŒì´ì¬ê³¼ ì¼€ë¼ìŠ¤ë¡œ 배우는 강화학습 ì €ìžíŠ¹ê°•
파ì´ì¬ê³¼ ì¼€ë¼ìŠ¤ë¡œ 배우는 강화학습 ì €ìžíŠ¹ê°•
Woong won Lee
Ìý
03. linear regression
03. linear regression03. linear regression
03. linear regression
Jeonghun Yoon
Ìý
Deep Learning from scratch 5장 : backpropagation
 Deep Learning from scratch 5장 : backpropagation Deep Learning from scratch 5장 : backpropagation
Deep Learning from scratch 5장 : backpropagation
JinSooKim80
Ìý
2.linear regression and logistic regression
2.linear regression and logistic regression2.linear regression and logistic regression
2.linear regression and logistic regression
Haesun Park
Ìý
2.supervised learning(epoch#2)-3
2.supervised learning(epoch#2)-32.supervised learning(epoch#2)-3
2.supervised learning(epoch#2)-3
Haesun Park
Ìý
ì¸ê³µì§€ëŠ¥, 기계학습 그리고 딥러ë‹
ì¸ê³µì§€ëŠ¥, 기계학습 그리고 딥러ë‹ì¸ê³µì§€ëŠ¥, 기계학습 그리고 딥러ë‹
ì¸ê³µì§€ëŠ¥, 기계학습 그리고 딥러ë‹
Jinwon Lee
Ìý
Deep Learning from scratch 3장 : neural network
Deep Learning from scratch 3장 : neural networkDeep Learning from scratch 3장 : neural network
Deep Learning from scratch 3장 : neural network
JinSooKim80
Ìý
02.09 naive bayesian classifier
02.09 naive bayesian classifier02.09 naive bayesian classifier
02.09 naive bayesian classifier
Dea-hwan Ki
Ìý
ë‚´ê°€ ì´í•´í•˜ëŠ” SVM(왜, 어떻게를 중심으로)
ë‚´ê°€ ì´í•´í•˜ëŠ” SVM(왜, 어떻게를 중심으로)ë‚´ê°€ ì´í•´í•˜ëŠ” SVM(왜, 어떻게를 중심으로)
ë‚´ê°€ ì´í•´í•˜ëŠ” SVM(왜, 어떻게를 중심으로)
SANG WON PARK
Ìý
Eigendecomposition and pca
Eigendecomposition and pcaEigendecomposition and pca
Eigendecomposition and pca
Jinhwan Suk
Ìý
하스켈 성능 튜ë‹
하스켈 성능 튜ë‹í•˜ìŠ¤ì¼ˆ 성능 튜ë‹
하스켈 성능 튜ë‹
ë¯¼ì„ ì´
Ìý
04. logistic regression ( 로지스틱 회귀 )
04. logistic regression ( 로지스틱 회귀 )04. logistic regression ( 로지스틱 회귀 )
04. logistic regression ( 로지스틱 회귀 )
Jeonghun Yoon
Ìý

Dsh data sensitive hashing for high dimensional k-nn search

  • 1. DSH:DataSensitiveHashingfor High-Dimensionalk-NNSearch Choi1, Myung2, Lim1, Song2 1DataKnow. Lab. 2D&C Lab. Korea Univ. Jinyang Gao, H. V. Jagadish, Wei Lu, Beng Chin Ooi SIGMOD `14
  • 3. 3/ 12 App: Large Scale Image Search in Database • Find similar images in a large database (e.g. google image search) Kristen Grauman et al slide: Yunchao Gong UNC Chapel Hill yunchao@cs.unc.edu
  • 4. 4/ 12 Feature Vector? High Dimension? • Feature Vector: Example • Nudity detection Alg. Based on Neural Network by Choi • Image File (png) -> 8 x 8 vector (0, 0, 0, …, 0.3241, 0.00441, …) • 현업ì—서는 ë” ë§Žì€ dimensionì˜ feature vector를 사용
  • 5. 5/ 12 Image Search, 그리고 kNN • ì´ë¯¸ì§€ë¥¼ 나타내는 d-ì°¨ì›ì˜ feature vector 집합 𔻠⊂ ℠𑑠• ð‘‘1, ð‘‘2 ∈ â„ ð‘‘ì— ëŒ€í•´ • Dist(ð‘‘1, ð‘‘2)ê°€ 작으면 ð‘‘1, ð‘‘2 ê°€ 서로 유사한 ì´ë¯¸ì§€ë¼ê³  하ìž. • Dist(ð‘‘1, ð‘‘2)ê°€ í¬ë‹¤ë©´ ð‘‘1, ð‘‘2 ê°€ 서로 ìƒì´í•œ ì´ë¯¸ì§€ë¼ê³  하ìž. • ì§ˆì˜ ì´ë¯¸ì§€ Q 를 ℠𑑠공간 ìƒì˜ í•œ ì  ð‘ž ìœ¼ë¡œ í‘œí˜„í•´ë³´ìž â€¢ ð‘žð‘¢ð‘’ð‘Ÿð‘¦ 𑞠∈ ℠𑑠• Q 와 유사한 ì´ë¯¸ì§€ë¥¼ kê°œ ë§Œí¼ ì°¾ëŠ” 문제는 k-NN 문제로 변환 가능 • Return k − NN(ð‘ž, ð”») R-Tree 기반 kNN Searchë¡œ 문제 í•´ê²° 가능? 불가능: Curse of dimensionality
  • 6. 6/ 12 Reality Check • Curse of dimensionality • [Qin lv et al, Image Similarity Search with Compact Data Structures @CIKM`04] • • poor performance when the number of dimensions is high Roger Weber et al, A Quantitative Analysis and Performance Study for Similarity-Search Methods in High- Dimensional Spaces @ VLDB`98
  • 7. 7/ 12 Data Sensitive Hashing • a Solution to the Approximate k-NN Problem in High-Dimensional Space • 𛿠− recall K-NN Problem • Recall: |ð‘žð‘¢ð‘’ð‘Ÿð‘¦ ð‘Ÿð‘’ð‘ ð‘¢ð‘™ð‘¡ ð‘ ð‘’𑡠∩ ð¾ð‘ð‘ ð‘œð‘ð‘—ð‘’ð‘ð‘¡ð‘ | | ð¾ð‘ð‘ ð‘œð‘ð‘—ð‘’ð‘ð‘¡ð‘  | Curse of Dimensionality Recall ë°ì´í„° ë¶„í¬ ì˜í–¥ 기반 기술 Scan X (ì—†ìŒ) 1 X N/A RTree-based Solution O (강함) 1 â–³ index: Tree Locality Sensitive Hashing â–³ (ëœí•¨) < 1 O Hashing + Mathematics Data Sensitive Hashing â–³ (ëœí•¨) < 1 â–³ Hashing + Machine Learning KNN objects Query Result Set
  • 8. Related Work: LSH ð’“ ðŸ, ð’“ ðŸ, ð’‘ ðŸ, ð’‘ ðŸ − ð’”ð’†ð’ð’”ð’Šð’•ð’Šð’—ð’† ð‘¯ð’‚ð’”ð’‰ð’Šð’ð’ˆ ð‘­ð’‚ð’Žð’Šð’ð’š ð“— Randomly extract functions â„Ž11, â„Ž12 , … , â„Ž1𑚠→ g1 â„Ž21, â„Ž22 , … , â„Ž2𑚠→ g2 â„Žð‘™1, â„Žð‘™2 , … , â„Žð‘™ð‘š → g 𑙠… Generating functions a. ð‘–ð‘“ ð‘‘ð‘–ð‘ ð‘¡ ð‘œ, 𑞠≤ ð‘Ÿ1 ð‘¡â„Žð‘’ð‘› Pr â„‹ â„Ž ð‘ž = â„Ž 𑜠≥ ð‘1 b. ð‘–ð‘“ ð‘‘ð‘–ð‘ ð‘¡ ð‘œ, 𑞠≥ ð‘Ÿ2 ð‘¡â„Žð‘’ð‘› Pr â„‹ [â„Ž ð‘ž = â„Ž(ð‘)] < p2
  • 9. 9/ 12 Locality Sensitive Hashing • 100 ì°¨ì›ì˜ 실수 공간(â„100)ì—ì„œ KNN 문제를 풀어야 한다. • What if!? • 유사한 ì ì€ 서로 Collisionì´ ì¼ì–´ë‚˜ê³ , • ìƒì´í•œ ì ì€ Collisionì´ ì¼ì–´ë‚˜ì§€ 않는 • â„Žð‘–ð‘‘ð‘’ð‘Žð‘™:â„100 → ℤ+ì´ ìžˆë‹¤ë©´ 어떨까? Query Point 그러나 ì´ëŸ¬í•œ ì´ìƒì ì¸ 함수는 존재하지 ì•ŠìŒ
  • 10. 10/ 12 Formally, ð’“ ðŸ, ð’“ ðŸ, ð’‘ ðŸ, ð’‘ ðŸ − ð’”ð’†ð’ð’”ð’Šð’•ð’Šð’—ð’† ð‘¯ð’‚ð’”ð’‰ð’Šð’ð’ˆ ð‘­ð’‚ð’Žð’Šð’ð’š ð“— a. ð‘–ð‘“ ð‘‘ð‘–ð‘ ð‘¡ ð‘œ, 𑞠≤ ð‘Ÿ1 ð‘¡â„Žð‘’ð‘› Pr â„‹ â„Ž ð‘ž = â„Ž 𑜠≥ ð’‘ ðŸ b. ð‘–ð‘“ ð‘‘ð‘–ð‘ ð‘¡ ð‘œ, 𑞠≥ ð‘Ÿ2 ð‘¡â„Žð‘’ð‘› Pr â„‹ [â„Ž ð‘ž = â„Ž(ð‘)] ≤ ð© 🠕 informally, a. ð‘–ð‘“(ë‘ ì ì´ 유사하다면) ð‘¡â„Žð‘’ð‘› ë‘ ì ì˜ hash 함수 ê°’ì´ ê°™ì„ í™•ë¥ ì´ ë†’(아야한)다. b. ð‘–ð‘“(ë‘ ì ì´ 유사하지 않다면) ð‘¡â„Žð‘’ð‘› ë‘ ì ì˜ hash 함수 ê°’ì´ ê°™ì„ í™•ë¥ ì´ ë‚®(아야한)다. • Intuitively, • ð’‘ ðŸ = ðŸ, ð’‘ ðŸ = ðŸŽì¸ Hash 함수를 만들 수 있다면? • 그러나 ì´ëŸ¬í•œ ì´ìƒì ì¸ 함수는 존재하지 ì•ŠìŒ â€¢ Challenging • ℋ를 ë„출하는 것 ìžì²´ê°€ 수학ì ìœ¼ë¡œ 어려움! • ë„출했다 하ë”ë¼ë„ 대체로 ð‘1는 낮으며, p2는 ë†’ìŒ ë¬¸ì œì  1: ë„ì¶œì€ ê°€ëŠ¥í•˜ë‚˜, ð’‘ ðŸê°€ 너무 높다 (낮아야 하는ë°!)
  • 11. 11/ 12 Random projection (backup slide 참조) • Formally a. ð‘–ð‘“ ð‘‘ð‘–ð‘ ð‘¡ ð‘œ, 𑞠≤ ð‘Ÿ1 ð‘¡â„Žð‘’ð‘› Pr â„‹ â„Ž ð‘ž = â„Ž 𑜠≥ ð’‘ ðŸ b. ð‘–ð‘“ ð‘‘ð‘–ð‘ ð‘¡ ð‘œ, 𑞠≥ ð‘Ÿ2 ð‘¡â„Žð‘’ð‘› Pr â„‹ [â„Ž ð‘ž = â„Ž(ð‘)] ≤ ð’‘ ðŸ slide: Yunchao Gong UNC Chapel Hill yunchao@cs.unc.edu ë¬¸ì œì  1: ë„ì¶œì€ ê°€ëŠ¥í•˜ë‚˜, ð’‘ ðŸê°€ 너무 높다 (낮아야 하는ë°!) í•´ê²°ì±… 1: 함수를 여러 개로(ð’Ž ê°œ) 묶어서 사용해보ìž! 0 1
  • 12. 12/ 12 m-concatination • let ð‘” ð‘¥ = (â„Ž1 ð‘¥ ,â„Ž2 ð‘¥ ,…,â„Ž ð‘š ð‘¥ ) • 거리가 먼 ë‘ ì  q, pì— ëŒ€í•´ • Pr ð‘”∈ℊ [ð‘” ð‘ž = ð‘”(ð‘)] ≤ Pr ℎ∈ℋ â„Ž ð‘ž = â„Ž ð‘ 𑚠≤ ð’‘ ðŸ 𒎠≪ ð’‘ ðŸ 0 1 0 1 0 1Fergus et al slide: Yunchao Gong UNC Chapel Hill yunchao@cs.unc.edu ë¬¸ì œì  1: ë„ì¶œì€ ê°€ëŠ¥í•˜ë‚˜, ð’‘ ðŸê°€ 너무 높다 (낮아야 하는ë°!) í•´ê²°ì±… 1: 함수를 여러 개로(ð’Ž ê°œ) 묶어서 사용해보ìž! 효과: false positive ê°ì†Œ 유사하지 ì•Šì€ ë‘ ì ì— 대해 Pr ð‘”∈ℊ [ð‘” ð‘ž = ð‘”(ð‘)] ≤ ð’‘ ðŸ 𒎠≪ ð’‘ ðŸ 101 001 100 111
  • 13. 13/ 12 Random projection • ð’‘ ðŸì´ 아주 ë†’ì€ 0.8ì´ë¼ê³  하ë”ë¼ë„ • ð’Ž=5ì´ë¼ë©´, • 유사한 ë‘ ì ì— 대해 Pr ð‘”∈ℊ [ð‘” ð‘ž = ð‘”(ð‘)] ≥ 0.33 • 즉, 만약 í•œ ê°œì˜ ð‘”ë¡œ Hash table 구성 ì‹œ, • ì§ˆì˜ ì§€ì  q와 아주 유사한 ì ì˜ 수가 100ê°œë¼ë©´ • ê·¸ 중 33ê°œ ì´ìƒ 찾는 ê²ƒì„ ë³´ìž¥í•´ì£¼ê² ë‹¤ëŠ” 뜻 • ë‚®ì€ Recallì„ ê°–ê²Œ ë¨! • ð’=5ë¼ë©´, 1 − 1 − ð’‘ ðŸ ð’Ž ð’ ≥ 0.86ì´ë¯€ë¡œ, • í‰ê· ì ìœ¼ë¡œ 86ê°œ ì´ìƒ ì°¾ì„ ìˆ˜ 있다는 뜻 slide: Yunchao Gong UNC Chapel Hill yunchao@cs.unc.edu ë¬¸ì œì  1: ë„ì¶œì€ ê°€ëŠ¥í•˜ë‚˜, ð’‘ ðŸê°€ 너무 높다 (낮아야 하는ë°!) í•´ê²°ì±… 1: 함수를 여러 개로(ð‘š ê°œ) 묶어서 사용해봤다! 효과: false positive ê°ì†Œ 유사하지 ì•Šì€ ë‘ ì ì— 대해 Pr ð‘”∈ℊ [ð‘” ð‘ž = ð‘”(ð‘)] ≤ ð’‘ ðŸ 𒎠≪ ð’‘ ðŸ 역효과: false negativeë„ ì¦ê°€ 유사한 ë‘ ì ì— 대해 Pr ð‘”∈ℊ [ð‘” ð‘ž = ð‘”(ð‘)] > ð’‘ ðŸ ≫ ð’‘ ðŸ ð’Ž ë¬¸ì œì  2: ð’‘ ðŸ ð’Ž ê°€ 낮아지는 바람ì—, ð’‘ ðŸ ð’Ž ë„ ë‚®ì•„ì¡Œë‹¤ (높아야 하는ë°!) í•´ê²°ì±… 2: g를 여러 ê°œ (ð’ ê°œ) 사용한 후 ê·¸ 중ì—ì„œ k-NNì„ ì°¾ìž! 효과: High Recall 즉, 1 − 1 − ð’‘ ðŸ ð’Ž ð’ë¼ëŠ” recallì„ ë‹¬ì„±í•  수 있ìŒ
  • 14. 14/ 12 Structure • LSH • a Set of Hash tables Hi 1 ≤ i ≤ ð‘™} • Hash Function ð‘”i:â„100 → 0,1 ð‘š ð‘“ð‘œð‘Ÿ Hi • for example, ð‘š = 6, ð‘™ = 26 Key Bucket 000000 000001 ... 111111 H1 Key Bucket 000000 000001 ... 111111 H2 Key Bucket 000000 000001 ... 111111 H26 ... ð‘”1 ð‘”2 ð‘”26
  • 15. 15/ 12 Processing: ë„ì‹í™” • Query Pont q = 984.29,946.23,…,848.21 • Processing • Step 1. Candidate Set C = ð‘–=1 ð‘¡ð‘œ26 Hi. ð‘”ð‘’ð‘¡(ð‘ž) • Step 2. return k_Nearest_Neighbors(q) in C • linear search Key Bucket 000000 000001 ... 111111 H1 Key Bucket 000000 000001 ... 111111 H2 Key Bucket 000000 000001 ... 111111 H26 ...ð‘”1 ð‘”2 ð‘”26
  • 16. 16/ 12 Formally, ð’“ ðŸ, ð’“ ðŸ, ð’‘ ðŸ, ð’‘ ðŸ − ð’”ð’†ð’ð’”ð’Šð’•ð’Šð’—ð’† ð‘¯ð’‚ð’”ð’‰ð’Šð’ð’ˆ ð‘­ð’‚ð’Žð’Šð’ð’š ð“— Randomly extract functions â„Ž11, â„Ž12 , … , â„Ž1𑚠→ g1 â„Ž21, â„Ž22 , … , â„Ž2𑚠→ g2 â„Žð‘™1, â„Žð‘™2 , … , â„Žð‘™ð‘š → g 𑙠… Generating functions a. ð‘–ð‘“ ð‘‘ð‘–ð‘ ð‘¡ ð‘œ, 𑞠≤ ð‘Ÿ1 ð‘¡â„Žð‘’ð‘› Pr â„‹ â„Ž ð‘ž = â„Ž 𑜠≥ ð‘1 b. ð‘–ð‘“ ð‘‘ð‘–ð‘ ð‘¡ ð‘œ, 𑞠≥ ð‘Ÿ2 ð‘¡â„Žð‘’ð‘› Pr â„‹ [â„Ž ð‘ž = â„Ž(ð‘)] < p2 Traditional LSH Technique: â‘  Derive â„‹ mathematically â‘¡ prove that a. and b. holds for an arbitrary h ∈ â„‹ w.r.t. parameter ð‘Ÿ1, ð‘Ÿ2 â‘¢ Randomly extract functions and build Hash Table. In DSH(Data Sensitive Hashing): â‘  learn â„Ž by using adaptive boosting and â„‹ = â„‹ ∪ {â„Ž} â‘¡ If â„‹ is not sufficient to guarantee that a. and b. holds w.r.t ∀ℎ ∈ â„‹ go to â‘  â‘¢ Randomly extract functions and build Hash Table.
  • 17. 17/ 12 LSH VS DSH Traditional LSH Technique: â‘  Derive â„‹ mathematically â‘¡ prove that a. and b. holds for an arbitrary h ∈ â„‹ w.r.t. parameter ð‘Ÿ1, ð‘Ÿ2 â‘¢ Randomly extract functions and build Hash Table. In DSH(Data Sensitive Hashing): â‘  learn â„Ž by using adaptive boosting and â„‹ = â„‹ ∪ {â„Ž} â‘¡ If â„‹ is not sufficient to guarantee that a. and b. holds w.r.t ∀ℎ ∈ â„‹ go to â‘  â‘¢ Randomly extract functions and build Hash Table. ë°ì´í„° ë¶„í¬ ê³ ë ¤ 기반 기술 Locality Sensitive Hashing X (애당초 Uniform Distributionì„ ê°€ì •í–ˆê¸° ë•Œë¬¸ì— (for â‘¡)) Hashing + Mathematics Data Sensitive Hashing O (ëŒ€ìƒ ë°ì´í„° 분í¬ë¥¼ 기준으로 강제로 h를 뽑아 내기 때문ì—) Hashing + Machine Learning
  • 18. 18/ 12 LSH VS DSH 2 Sensitive 기반 기술 Locality Sensitive Hashing ð‘Ÿ1, ð‘Ÿ2ì— ë”°ë¼ Sensitiveí•œ Hashing Hashing + Mathematics Data Sensitive Hashing Data (k-NN ê³¼ non-ck-NN) ì— Sensitiveí•œ Hashing Hashing + Machine Learning
  • 20. 20/ 12 Example: Data Set • 100-dimensional data set ð· • ð· = 100 • 10 clusters
  • 21. 21/ 12 Build DSH for D • DSH dsh = new DSH(10, 1.1, 0.7, 0.6, 0.4, querySet, dataSet); Parameter Value k (k-NN) 10 𛼠(학습률) 1.1 𛿠(lower bound of recall) 70% ð‘1 0.6 ð‘2 0.4 Query Set D Data Set D
  • 22. 22/ 12 Structure • DSH • a Set of Hash tables Hi 1 ≤ i ≤ ð‘™} • Hash Function ð‘”i:â„100 → 0,1 ð‘š ð‘“ð‘œð‘Ÿ Hi • for example, ð‘š = 6, ð‘™ = 26 Key Bucket 000000 000001 ... 111111 H1 Key Bucket 000000 000001 ... 111111 H2 Key Bucket 000000 000001 ... 111111 H26 ... ð‘”1 ð‘”2 ð‘”26
  • 23. 23/ 12 Query Example • res = dsh.k_Nearest_Neighbor(q=new Point(984.29, 946.23, ..., 848.21))); • return 10-aNN objs from the given point q • DSH’s Property: • Result set must include at least 70% of the exact 10-NN objs • Result: Query Point p: (984.29, 946.23, ..., 848.21) 10-aNN of P (recall: 100%)
  • 24. 24/ 12 Processing: dsh.k_Nearest_Neighbor(q) • Query Pont q = 984.29,946.23,…,848.21 • Processing • Step 1. Candidate Set C = ð‘–=1 ð‘¡ð‘œ26 Hi. ð‘”ð‘’ð‘¡(ð‘ž) • Step 2. return k_Nearest_Neighbors(q) in C • linear search Key Bucket 000000 000001 ... 111111 H1 Key Bucket 000000 000001 ... 111111 H2 Key Bucket 000000 000001 ... 111111 H26 ...ð‘”1 ð‘”2 ð‘”26
  • 25. 25/ 12 Hi. ð‘”ð‘’ð‘¡(ð‘ž) • Query Pont q = 984.29,946.23,…,848.21 • H1. ð‘”ð‘’ð‘¡(ð‘ž) = H1 ð‘”1(ð‘ž) = H1 1110102 = H1[5810] • g1 q = (â„Ž11 ð‘ž ,â„Ž12 ð‘ž ,…,â„Ž16 ð‘ž ) = (1,1,0,0,1,0) <H1>
  • 26. 26/ 12 Processing: dsh.k_Nearest_Neighbor(q) • for each H𑖠• H1. ð‘”ð‘’ð‘¡(ð‘ž) ={Data(id=93~98, 100~102)} • H2. ð‘”ð‘’ð‘¡ ð‘ž = ... • ... • H26. ð‘”ð‘’ð‘¡ ð‘ž = ... • Candidate Set C = ð‘–=1 ð‘¡ð‘œ 26 Hi. ð‘”ð‘’ð‘¡(ð‘ž) • dsh.k_Nearest_Neighbor(q) • = k_Nearest_Neighbors(q) in C
  • 27. 27/ 12 Processing: dsh.k_Nearest_Neighbor(q) • for each H𑖠• H1. ð‘”ð‘’ð‘¡(ð‘ž) ={Data(id=93~98, 100~102)} • H2. ð‘”ð‘’ð‘¡ ð‘ž = ... • ... • H26. ð‘”ð‘’ð‘¡ ð‘ž = ... • Candidate Set C = ð‘–=1 ð‘¡ð‘œ 26 Hi. ð‘”ð‘’ð‘¡(ð‘ž)
  • 28. 28/ 12 Processing: dsh.k_Nearest_Neighbor(q) • Candidate Set C = {ð‘‘ð‘Žð‘¡ð‘Ž93, ð‘‘ð‘Žð‘¡ð‘Ž94,…} • C = 28 • result <- Find k-NN(q) in C • dsh.k_Nearest_Neighbor(q) • return result Query Pont q = 984.29, 946.23, … , 848.21 T
  • 29. How to build DSH for D?
  • 30. 30/ 12 Build DSH for D • Step 1. Generate ð“—, Data Sensitive Hashing Family (Chapter 3-4) • Step 2. Generate Hash Function by Randomly extracting hash functions Generating Hashing Family Randomly extract functions â„Ž11, â„Ž12 , … , â„Ž1𑚠→ g1 â„Ž21, â„Ž22 , … , â„Ž2𑚠→ g2 â„Žð‘™1, â„Žð‘™2 , … , â„Žð‘™ð‘š → g 𑙠… Generating functions •Step 3. for each g ð‘™, •Initialize Hash Table ð‘‡ð‘™ for g ð‘™ (<key, value> = <Integer array, Data>) •for each o ∈ ð·, ð‘‡ð‘™.put(g ð‘™ o , o)
  • 34. 34/ 12 a Weak Classifier • a Weak Classifier ðœ‘(< ð‘žð‘–, ð‘œð‘— >) is a function • Input: <query ð‘žð‘–, data ð‘œð‘—>pair • Desired output: 0, ð‘–ð‘“ ð‘œð‘— ∈ ð‘˜ð‘ð‘(ð‘žð‘–) 1, ð‘–ð‘“ ð‘œð‘— ∉ ð‘ð‘˜ð‘ð‘(ð‘žð‘–) a Weak Classifier kNN Pair < ð‘žð‘–, ð‘œð‘— > 0 (correct) a Weak Classifier non-ckNN Pair < ð‘žð‘–, ð‘œð‘— > 0 (incorrect) a Weak Classifier kNN Pair < ð‘žð‘–, ð‘œð‘— > 1 (incorrect) a Weak Classifier non-ckNN Pair < ð‘žð‘–, ð‘œð‘— > 1 (correct) note: a Weak Classifier may produce a lot of incorrect result
  • 35. 35/ 12 Weak Classifier 3 Weak Classifier 2 Adaptive Boosting • Build Strong Classifier by combining several weak classifiers Weak Classifier 1 1st : Query-Data Pair (< ð’’ð’Š, ð’ð’‹ >) Set weakclassifiertrainer test Well Classified Pair Badly Classified Pair Feed back 2nd : Query-Data Pair (< ð’’ð’Š, ð’ð’‹ >) Set Well Classified Pair Badly Classi fied Pair Feed back 3rd : Query-Data Pair (< ð’’ð’Š, ð’ð’‹ >) Set Well Classified Pair
  • 36. 36/ 12 Weak Classifier 3 Weak Classifier 2 a Strong Classifier • Build Strong Classifier by combining several weak classifiers Weak Classifier 1 Query-Data Pair (< ð’’ð’Š, ð’ð’‹ >) Set a Strong Classifier Badly Classi fied Pair Well Classified Pair
  • 37. 37/ 12 Adaptive Boosting • Build Strong Classifier by combining several weak classifiers Weak Classifier 3 Weak Classifier 2 Weak Classifier 1 1st : Query-Data Pair (< ð’’ð’Š, ð’ð’‹ >) Set weakclassifiertrainer test Well Classified Pair Badly Classified Pair Feed back 2nd : Query-Data Pair (< ð’’ð’Š, ð’ð’‹ >) Set Well Classified Pair Badly Classi fied Pair 3rd : Query-Data Pair (< ð’’ð’Š, ð’ð’‹ >) Set Well Classified Pair
  • 38. Single Hash Function Optimization
  • 39. 39/ 12 Notation • Query Set Q = (Q1,Q2,…,Qq) • Data Set X = (X1,X2,…,Xn) • Weight Matrix W • Wij = 1,if Xj is a k − NN of Qi −1,if Xj is a (sð‘Žð‘šð‘ð‘™ð‘’ð‘‘) non − ck − NN of Qi 0, ð‘’ð‘™ð‘ ð‘’ 1 2 1 2 3 4 ( )1 1 0 -1 -1 0 1 1 1 42 3 1 2 1 41 2 23 k = 2, c = 3 2 sampling rate = 1
  • 40. 40/ 12 Objective • ð‘Žð‘Ÿð‘”min â„Ž ð‘–ð‘— ðœ‘â„Ž < ð‘„ð‘–, ð‘‹ð‘— > ∙ ð‘Šð‘–𑗠• =ð‘Žð‘Ÿð‘”min â„Ž ð‘–ð‘— â„Ž ð‘„𑖠− â„Ž ð‘‹ð‘— 2 ∙ ð‘Šð‘–ð‘—

Editor's Notes