狠狠撸

??? ???? ????:

??? ?????

????? ?? ?????
???

NUMBERWORKS ??

???
? ?
??? ‘????’??
data science firm? ?????

?? ??? ?????
1. ????? ?? ?? ????

2. ?? ???? ???? ??

3. ???? ? ??? ?????.

4. ??

Recommendation systems
come in
two types

Content-based
Collaborative ?
Filtering
vs

Content-based??????.
-??????????????
-????????????????
-?????????????????.

Collaborative Filtering
??? ??? ??????
? ???.

[261] ???????? ??????????? ???????????? ?????????

10
Collaborative Filtering
?? 2??? ??? ???.
Memory-based Model-based
??? ?? ??(hybrid)?? ???.

11
Memory-based?
???
??? ?????
?? ? ??? ?????.

YOU
Someone?
LIKEYOU!
recommend

?? ?? ??? Jaccard similarity
???? ??? ??? ???? ??
A? ??? ??
B? ??? ??

img from : http://thepage.time.com/2009/04/18/why-is-this-elephant-crying/
?? Hadoop???
?? ? ?? ???

?? ??? ??? ??
pre-clustering? ?????

??? naive ?
clustering ????
O(N^2)
???? ??
????? ????
????? ???? ???
? ??? ?????
??? ??!

pre-clustering? ???
??? ??? ??? ????

Use MinHASH as LSH(min-wiseindependentpermutationslocalitysensitivehashing)
? ??? ?? ?????

?? ? ?????
????.
Locality Sensitive Hashing

?? ?????? hash? ?
???? ?? ?? ??? ?????

?? ?? ??? ??? ?
??? ???? ???? ????
clustering?? ???? ??.

Make hash functions
Hash function #1 Hash function #2
HOW DOES LSH WORK?
(localitysensitivehashing)

?????? ?? O(n2) ???
hash snapshot?? O(n) ??
?????

TIME
SIZE
2) ??? ??? O(n2)
??? ???? ??
n?? ????.
1) ??? ???
??? ??? ??
????? ???

?? hash function?
??? ??? ??? ?? ????? ?? hash??.
??? ??? ??? ???
??? ????, ????? ??? ??
??? hash function? ????.

- cosine similarity
- hamming distance
- euclidean distance
- jaccard similarity
??? ??? ??.

MinHash?
Jaccard similarity?
???? ???
LSH
?? ?? ????!

A B C
r1 1 0 1
r2 0 1 1
r3 0 0 1
r4 1 1 0
r5 1 0 0
r6 0 1 1
A B C
h1 2 4 1
h2 2 1 1
h3 1 2 3
r3
r5
r1
r2
r4
r6
r6
r1
r3
r2
r4
r5
hash1 hash2
random permutation
r5
r4
r6
r2
r1
r3
hash3
random?? permutation? ??? ???? ? 1? index? ???.
jaccard(B,C) = 2/5 = 0.4
jaccard(A,C) = 1/6 = 0.16
sim(B,C) = 1/3 = 0.3
sim(A,C) = 0/3 = 0
jaccard = intersect/union sim = intersect/length

???? random permutation? ??? ??
???? ?? ???? ???
universal hash? ?? random generation
?? random permutation? ???(proxy)?
???? ??.
?? ?? ??? ????
??? ??? ????
minhash? ?? ? ????.

?? ??? dimension reduction
A B C
r1 1 0 1
r2 0 1 1
r3 0 0 1
r4 1 1 0
r5 1 0 0
r6 0 1 1
A B C
h1 2 2 1
h2 2 1 1
h3 1 2 3
6??
3??

Q P R S
s1 1 2 2 1
s2 3 1 3 1
s3 1 2 1 2
s4 4 1 1 4
hash?? ?? ?? signature? ??
???? ??? concate??? cluster id ???.
‘s1-s2’, ‘s2-s3’,
‘s3-s4’, ‘s4-s1’
Q
13
Q, R
31
Q
14
Q,S
41
P
21
P, R
12
P,S
12
R
23
R,S
11
S
24

??? typical?
Batch Implementation?
??? ????.
(item to item ???? ??)

minhash? ???? ???
for ?? in ?? ???:
? ??? ?? click stream ????
??? ??? minhash signature ????
signature?? concate?? ??? cluster id
? ??? ?? cluster id?? ????
for cluster in ???? ?????:
if length(????) > threshold:
????? ??? ?? ??

???? ??? ????
?? item? ???
? ?? item? minhash?? ???
minhash?? concate?? ???? cluster ? ? ??
for cluster in ??? ??? ?????:
????? ??? ?? ?? item? ??
? item?? click stream? ??
click stream??? ??? ?? pair ??? ??
? ????? ?? ??? ?? ??? top N?? ??

It is typical
implementation.
but not attractive :(

Heavy I/O ? ?? : ???? I/O
??? ?? ??? ??
??!
minhash? ?????
for ?? in ?? ???:
? ??? ?? click stream ????
??? ??? minhash signature ????
signature?? concate?? ??? cluster id
? ??? ?? cluster id?? ????
for cluster in ???? ?????:
if length(????) > threshold:
????? ??? ?? ??
???? ??? ????
?? item? ???
? ?? item? minhash?? ???
minhash?? concate?? cluster id ?????
?? ?? ??? ??? cluster?? ? ??
for cluster in ??? ??? ?????:
????? ??? ?? item? ??
? item?? click stream? ??
click stream??? ??? ?? pair ??? ??
? ????? ?? ??? ?? ??? top N?? ??
?? ??? ???? ??

- speed gain
??? ????.
??? ????.
- quality loss
?? :
?? :
????? ??

??? ??
???? ?? ???
??? ????
??? ???? ? ????

? I/O ??? ?? ????
??? ??? ??? ??? ??????.

???? ???? ?? ?? : user? view?
??? ??
?? ??

???? ???? ?? ?? : item? view?
??? ???? item
?? item
??? ? ?? item

?? ???? ??? ??? ??
= ?????? ?? ?? (??????)
= ???? ? click stream
= ????? ?? ???
= ????? ?? ??? ?? ????.
= ?? page out? ????.
=????? ????.

?? ???
?? ???
?? ???? ??
?? ???
?? ???
? ?? ???? ???
???? ??
??? ???
???? ?? ?? ??
?? ? ??.

?? ??
= ?? ??? ?? ??, ?? ???
= use good dimension reducer
= ????? ??? ?? ???? ???.
= ? ??? ?? ?? ????
= minhash

?? ???
S1 S2 … Sn
S1 S2 … Sn
?? ???
hash function? n?? signature??
??? ?? ?
?? ??

S1 S2 S3 S4
signature? ?? ??? sig? ??? ? jaccard
S5
S1 S2 S3 S4 S5
??? jaccard
coefficient
2/5
= 0.4
??? ??? ???, ? ????
?? jaccard?? ??? ????

? ?? hash func? ??
signature? ??? ?? ???
??? ????!

??? ???? ????
???????
100?? signature?
root mean square error
??? ?? RMSE 0.03 ??
??? ????!
???? ????

?? ??? ??? ????
?? ?? ??? ??
signature? ?? ??
??? ????
?? clustering ??
?? ??? ?
click stream? ???
signature? ?? ??
??? ???
?? ???
click stream??
signature overlap? ???

? ??? ??
??? ??? in-memory? fit??
????? ??? ?? ??? ??

+??? ??? in-memory? fit??
+??? ??? ???? ?? ??
???? ?????? ??????
??? ??? ?? ?? ???.
???? ?? disk io?? ??

item??? 4byte?? ???? 100?
100 x 4 + 200 = 600 byte
?? ?? ?? 200 byte ? ?? ??
??? 1G? ??? 1024^3 / 800 = 1,789,569
1??? 180? ?? item? click stream?
???? ?? ?? ? ??! (?? ?? ?? ??)

+??? ??? ??? ? ?? ?? Job
“?? ? ??? ?????”
+?? ?? ?? ??? ??????? ??.
“??? ??? ???
?? ???? ?? ?? ???
?? ????”, “??? ????”

??? click? ????
??? ??? ?? signature???
? ?? ?? ?? ???
No!
minhash? ‘min ??’? ‘chain’??.
??? ?? ??? ??? ??.

Associative property ????
? ???? ???? ?? ???? ??
Idempotence ????
???? ???? ?? ???? ??.

???? ???? ??? ???
= ? ??? ?? ???? ??? ?
?? ??? ?????, ???? ??? ???!
click? ?????? ?? ????!

? ??? ???? high TPS? ??
save minhash(110)[1 DB write] =
load minhash(100)[1 DB read] x
compute minhash(101~110)[input buffer]
old minhash new new new
local minhash
memorystorage
updated minhash
??? ????
??? ?????
?????.
micro batch
?? ???
????? ???
?? ??.
(????)

?~ ??? ?
?? ?? ???? ? ????.
??? ??? ??
??? ??? ????

S1 S2 … Sn
??? ? ???.
S1 S2 … Sn
S1 S2 … Sn
S1 S2 … Sn
S1 S2 … Sn
??? n^2? ???
item A
item B
item C
item D
item E

?? ?? ?? ??
Secondary Index? ???.

item A
item B
item C
923 1032 58 74
87 1032 123 80
923 872 58 80
Sig1 Sig2 Sig3 Sig4
??? signature? ?? ? ? Secondary-Index KV Storage
Sig??? ? pair? Key? ??
?? ??? item???? Value
Key Value
Sig1-923 A-C
Sig2-872 C
sig3-58 A-C
sig4-80 B-C
A : 2?/4sigs
B : 1?/4sigs
A : 0.5
B : 0.25
Value? ??? item?? ?????
? sig? ??? ???? = jaccard? ?? =

Secondary Index lookup???
Jaccard? ??? ? ??.
???~!

requirement:
minhash??? ??????
2nd idx? ?? ?? ?? ????
= ??? ????? ???? ???!

Large Secondary-Index?
?? ????? ?????

Key Value
Sig1-923 A-C
Sig2-872 C
sig3-58 A-C
sig4-80 B-C
??? membership
??? ???? ??.
REDIS? ?? ?????? ? ?? ???

Redis Data structure candidates
Strings
Sets - support add, remove, union, intersection
- plain K/V
?????!
VS

?? ??? ??? string?? ??? ???? ? ???.
sig45 Tom Jerry Robert Jack
string?? ??? ??
“[Tom, Jerry, Robert, Jack]”
??? ? ?? ???
json.dumps(data)
json.loads(data_str)
[write] [read]
“[Tom, Jerry, Robert, Jack]”

??? ??? String??. ??!
Order! O(1) vs O(N)
In [24]: %time gs.load_benchmark('user','key')
CPU times: user 0.32 s, sys: 0.03 s, total: 0.34 s
Wall time: 0.42 s
In [25]: %time gs.load_benchmark('user','set')
CPU times: user 32.34 s, sys: 0.13 s, total: 32.47 s
Wall time: 33.88 s
1)
? ????? String? Set?? ?? ??.
string set

‘redis string’ ? mget ??? ????.
(multiple get at once)
N call round trip -> 1 call
In [9]: %timeit [redis.get(s) for s in sigs]
100 loops, best of 3: 9.99 ms per loop
In [10]: %timeit redis.mget(sigs)
1000 loops, best of 3: 759 us per loop
2)

3)
string ? ??? ?????? ????.
= ???? ? ?? ?? ?? ? ?? ???
compress speed(μs)
0.0
35.0
70.0
105.0
140.0
snappy zlib
134.0
17.3
size(%)
0%
25%
50%
75%
100%
snappy zlib
40%
70%
snappy??
????? ??
????? ??

redis ? pipe??? ??? transaction ? ???.
??? set? ???? ?? ??? ????
4)
item A 923 1032 58 74
item B 87 1032 123 80
item C 923 973 58 80
sig2-1032
sig2-973
A-B
C
Secondary-Index
973
sig2-1032
sig2-973
B
A-C
sig1 sig2 sig3 sig4 key:
pos-value
value:
member
case) sig2-1032? A? ???? sig2-973? ?? - atomic??
minhash signature ????

?? ? ?? ???
2nd index key space? ??????
?????? hash max val * sig cnt? key? ??
hash max? 433,494,437? Fibonacci primes? ??
100?? signature? ??? ??
????? 433,494,437,00 ?? key?? ??. ???.

minhash ? min? chain?? ???, ?
??? ??? ?? ??? ??
100
15,082,312
4334944370
sigcnt
43,349,443,700
0?? member? ?? key? ??? ??
example case)?
3,727,344 products, ?
100 sig =>
15082312 keys
???? ?, KV??
??? 2G ?? ??

REDIS? ????
transaction? ?????
?? update? ??
Secondary Index?
??????.

minhash? min?? ?????.
item Z? user click? ???? ? ???
Z 183 1032 942 80??? signature ??
click ? minhash?? 87 2043 123 300click
??? ?? ?? ?? ???
signature ? 2nd index??
new 87 1032 123 80
click ??? ??? ?? ??

?? item Z? ?? ?? ??? ????
Secondary Index ? ?? ? ?? flow
item Z 87 1032 123 80
??? signature ?? Sig1 - 87
Sig2 - 1032
Sig3 - 123
Sig4 - 80
A-B-D-F-Z
C-G-Z
A-B-D-Z
B-D-F-Zredis.mget?? ??? ???
??? ?????, ??? A:2/4 =0.5 B:3/4 =0.75
C:1/4 =0.25 D:3/4 =0.75

???? ????
??????
minhash?1G
2ndindex?2G
??? Mem Size
??1G
??? CPU Power
REDIS1core
minhash??1core
????1core

?? ??? ?????
?? ??? Memory based ???
??? ?? ???? ?
(ALS, NMF, Markov Chain??)
??? ??????? ? ??? ???.

99.9% ???? ?? ? ??? ??? ?
1169KB 51KB
logo.bmp logo.jpg
1/

?? ??? ??, ???? ??? ?? ?
???
??? ??
Amortized
?? ?? ??? ?
??? ????

?? ?? ???
?? ????? ?????
??? ?? ??
?? ??????.

??? ????? ????
?? ??? ?? ????
??? ??? ????
??? ??

狠狠撸

[261] ???????? ??????????? ???????????? ?????????

Recommended

More Related Content

What's hot (20)

Similar to [261] ???????? ??????????? ???????????? ????????? (20)

More from NAVER D2 (20)

[261] ???????? ??????????? ???????????? ?????????