This document discusses MapReduce design patterns. It describes the core MapReduce components including the mapper, reducer, and shuffle and sort. It then outlines several common MapReduce patterns such as filtering, summarization, joins, data organization, and input/output. Specific filtering patterns like bloom filtering and top-N are explained in more detail.
8. Bloom filtering
Removing most of non watched
values
Prefiltering a data set for an
expensive set membership
check
Probabilistic data structure
Hash functions comparing
Answer: probably yes or now
9. Step 1 - Filter
Training
Bloom Filter
Training
Input
split
Output
file
Step 2 - Bloom Filtering via MapReduce
Input
split
Bloom
Filter
Mapper
Maybe
Bloom Filter
Test
No
Discarded
Load filter from
distributed cache
Input
split
Output
file
Bloom
Filter
Mapper
Maybe
Bloom Filter
Test
Output
file
No
Load filter from
distributed cache
Discarded
11. Input
split
Top Ten
Mapper
local top 10
Input
split
Top Ten
Mapper
local top 10
Top Ten
Reducer
Input
split
Top Ten
Mapper
local top 10
Input
split
Top Ten
Mapper
local top 10
final top
10
Top 10
Output
17. Mapper
(keyword, unique ID)
(keyword, unique ID)
Partitoner
Reducer
Reducer
(keyword, unique ID)
(keyword, unique ID)
(keyword A, list of IDs)
(keyword D, list of IDs)
Partitoner
Mapper
(keyword, unique ID)
(keyword, unique ID)
Mapper
(keyword A, list of IDs)
(keyword D, list of IDs)
Partitoner
22. Node table
id
title
tagnames
authorized
User table
body
node type
parent id
abs parent id
added at
score
state string
last edited id
last activity id
last activity at
activity revision
extra
extra def
extra count
user id
reputation
gold
silver
bronze
25. Pig examples
- - Inner Join:
A = JOIN comments BY userID, users BY userID;
- - Outer Join:
A = JOIN comments BY userID [LEFT | RIFGT| FULL] OUTER , users BY userID;
- - Binning:
SPLIT data INTO
eights IF col1 == 8,
bigs IF col1 > 8,
smalls IF (col1 < 8 and col1 > 0 );
- - Top Ten:
B = ORDER A BY col4 DESC
C = limit B 10;
- - Filtering:
b = FILTER a BY value < 3;