�ݺ�ߣ

MapReduce
Design Patterns

Anastasiia Kornilova,
SoftServe Data Science Group

MapReduce Components
❖

record reader

❖

map

❖

Reader

combiner

❖

partitioner

❖

Mapper

Combiner

Partitioner

Shuffle
and sort

shuffle and sort

❖

reduce

❖

output format

Reducer

Output

MapReduce Patterns
❖

Filtering Patterns

❖

Summarization Patterns

❖

Join Patterns

❖

Data Organization Patterns

❖

Metapatterns

❖

Input and Output Patterns

Filtering patterns

❖

Filtering

❖

Bloom filtering

❖

Top-N

❖

Distinct

Filtering
❖

Closer view of data

❖

Tracking a thread of events

❖

Distributed grep

❖

Data cleansing

❖

Simple random sampling

❖

Removing low scoring data

Input
split

Filter
Mapper

Output
file

Input
split

Filter
Mapper

Output
file

Input
split

Filter
Mapper

Output
file

Bloom filtering
❖

Removing most of non watched
values

❖

Prefiltering a data set for an
expensive set membership
check

•
•
•

Probabilistic data structure
Hash functions comparing
Answer: probably yes or now

Step 1 - Filter
Training
Bloom Filter
Training

Input
split

Output
file

Step 2 - Bloom Filtering via MapReduce

Input
split

Bloom
Filter
Mapper

Maybe
Bloom Filter
Test

No
Discarded

Load filter from
distributed cache

Input
split

Output
file

Bloom
Filter
Mapper

Maybe
Bloom Filter
Test

Output
file

No
Load filter from
distributed cache

Discarded

Top N
❖

Outlier analysis

❖

Select interesting data

❖

Catchy dashboards

Input
split

Top Ten
Mapper

local top 10

Input
split

Top Ten
Mapper

local top 10

Top Ten
Reducer
Input
split

Top Ten
Mapper

local top 10

Input
split

Top Ten
Mapper

local top 10

final top
10

Top 10
Output

Distinct
❖

Deduplicate data

❖

Getting distinct values

❖

Protecting from inner join
explosions

Summarization patterns
❖

Numerical summarization

❖

Inverted index

❖

Counting with counters

Numerical summarization

❖

Word count

❖

Record count

❖

Min/Max/Count

❖

Average/Median/Standart
deviation

Mapper

Mapper

Mapper

(key, summary field)



Partitoner
Reducer

(group B, summary)
(group D, summary)

Reducer

(group B, summary)
(group D, summary)

Partitoner

Partitoner

Mapper

(keyword, unique ID)

Partitoner
Reducer

Reducer


(keyword A, list of IDs)
(keyword D, list of IDs)

Partitoner

Mapper


Mapper

(keyword A, list of IDs)
(keyword D, list of IDs)

Partitoner

Data Organization Patterns
❖

Structured to Hierarchical

❖

Partitioning

❖

Binning

❖

Total Order Sorting

❖

Shuffling

Join patterns

❖

Reduce Side Join

❖

Replicated Join

❖

Composite Join

❖

Cartesian Product

Data Set A
Input
split
Input
split
Input
split

Join
Mapper
Join
Mapper
Join
Mapper

(key, values
A)

(key, values
A)

Join
Reducer

Output
part

Join
Reducer

Output
part

Join
Reducer

Output
part

(key, values
A)

Shuffle
and sort

Data Set B
Input
split
Input
split

Join
Mapper
Join
Mapper

(key, values
B)
(key, values
B)

Node table

id
title
tagnames
authorized

User table

body
node type
parent id
abs parent id
added at
score
state string
last edited id
last activity id
last activity at
activity revision
extra
extra def
extra count

user id
reputation
gold
silver
bronze

Pig examples
- - Inner Join:
A = JOIN comments BY userID, users BY userID;

- - Outer Join:
A = JOIN comments BY userID [LEFT | RIFGT| FULL] OUTER , users BY userID;

- - Binning:
SPLIT data INTO
eights IF col1 == 8,
bigs IF col1 > 8,
smalls IF (col1 < 8 and col1 > 0 );

- - Top Ten:
B = ORDER A BY col4 DESC’
C = limit B 10;

- - Filtering:
b = FILTER a BY value < 3;

�ݺ�ߣ

MapReduce Design Patterns

More Related Content

MapReduce Design Patterns