際際滷

際際滷Share a Scribd company logo
MapReduce
Design Patterns

Anastasiia Kornilova,
SoftServe Data Science Group
MapReduce Components


record reader



map



Reader

combiner



partitioner



Mapper

Combiner

Partitioner

Shuffle
and sort

shuffle and sort



reduce



output format

Reducer

Output
MapReduce Design Patterns
MapReduce Patterns


Filtering Patterns



Summarization Patterns



Join Patterns



Data Organization Patterns



Metapatterns



Input and Output Patterns
Filtering patterns



Filtering



Bloom filtering



Top-N



Distinct
Filtering


Closer view of data



Tracking a thread of events



Distributed grep



Data cleansing



Simple random sampling



Removing low scoring data
Input
split

Filter
Mapper

Output
file

Input
split

Filter
Mapper

Output
file

Input
split

Filter
Mapper

Output
file
Bloom filtering


Removing most of non watched
values



Prefiltering a data set for an
expensive set membership
check





Probabilistic data structure
Hash functions comparing
Answer: probably yes or now
Step 1 - Filter
Training
Bloom Filter
Training

Input
split

Output
file

Step 2 - Bloom Filtering via MapReduce

Input
split

Bloom
Filter
Mapper

Maybe
Bloom Filter
Test

No
Discarded

Load filter from
distributed cache

Input
split

Output
file

Bloom
Filter
Mapper

Maybe
Bloom Filter
Test

Output
file

No
Load filter from
distributed cache

Discarded
Top N


Outlier analysis



Select interesting data



Catchy dashboards
Input
split

Top Ten
Mapper

local top 10

Input
split

Top Ten
Mapper

local top 10

Top Ten
Reducer
Input
split

Top Ten
Mapper

local top 10

Input
split

Top Ten
Mapper

local top 10

final top
10

Top 10
Output
Distinct


Deduplicate data



Getting distinct values



Protecting from inner join
explosions
Summarization patterns


Numerical summarization



Inverted index



Counting with counters
Numerical summarization



Word count



Record count



Min/Max/Count



Average/Median/Standart
deviation
Mapper

Mapper

Mapper

(key, summary field)
(key, summary field)

(key, summary field)
(key, summary field)

(key, summary field)
(key, summary field)

Partitoner
Reducer

(group B, summary)
(group D, summary)

Reducer

(group B, summary)
(group D, summary)

Partitoner

Partitoner
Inverted index
Mapper

(keyword, unique ID)
(keyword, unique ID)

Partitoner
Reducer

Reducer

(keyword, unique ID)
(keyword, unique ID)

(keyword A, list of IDs)
(keyword D, list of IDs)

Partitoner

Mapper

(keyword, unique ID)
(keyword, unique ID)

Mapper

(keyword A, list of IDs)
(keyword D, list of IDs)

Partitoner
Data Organization Patterns


Structured to Hierarchical



Partitioning



Binning



Total Order Sorting



Shuffling
Join patterns



Reduce Side Join



Replicated Join



Composite Join



Cartesian Product
MapReduce Design Patterns
Data Set A
Input
split
Input
split
Input
split

Join
Mapper
Join
Mapper
Join
Mapper

(key, values
A)

(key, values
A)

Join
Reducer

Output
part

Join
Reducer

Output
part

Join
Reducer

Output
part

(key, values
A)

Shuffle
and sort

Data Set B
Input
split
Input
split

Join
Mapper
Join
Mapper

(key, values
B)
(key, values
B)
Node table

id
title
tagnames
authorized

User table

body
node type
parent id
abs parent id
added at
score
state string
last edited id
last activity id
last activity at
activity revision
extra
extra def
extra count

user id
reputation
gold
silver
bronze
MapReduce Design Patterns
MapReduce Design Patterns
Pig examples
- - Inner Join:
A = JOIN comments BY userID, users BY userID;

- - Outer Join:
A = JOIN comments BY userID [LEFT | RIFGT| FULL] OUTER , users BY userID;

- - Binning:
SPLIT data INTO
eights IF col1 == 8,
bigs IF col1 > 8,
smalls IF (col1 < 8 and col1 > 0 );

- - Top Ten:
B = ORDER A BY col4 DESC
C = limit B 10;

- - Filtering:
b = FILTER a BY value < 3;

More Related Content

MapReduce Design Patterns