狠狠撸

MapReduce
Presented by Zilong Tan

The Data Model
● A (logical) file is a string
a1
a2
...an
,
where aj
is a substring.

The Data Model
a1
a2
...an
,
where aj
is a substring.
Eg: “Hellonworld!” ? a1
= “Hello”, a2
= “world!”
“n” is a separator.

The Data Model
a1
a2
...an
,
where aj
is a substring.
= “Hello”, a2
= “world!”
● Q1: How to equally split a file?
○ Eg: a1
a2
...a2n
? a1
a2
...an
and an+1
an+2
...a2n
.

The Data Model
a1
a2
...an
,
where aj
is a substring.
= “Hello”, a2
= “world!”
● Q1: How to equally split a file?
○ Eg: a1
a2
...a2n
? a1
a2
...an
and an+1
an+2
...a2n
.
● Q2: What about splitting the file into more segments?

The Map(aj
) Function
● Map: aj
→ {(key(aj
), val(aj
))}
● key(aj
) and val(aj
) are strings.
Eg: Map(“Hello”) = (“Hello”, “1”),
Map(“Hello world”) = {(“Hello”,“1”), (“world”,“1”)},
Map(“Hello world”) = (“world”, “Hello”).

Contd.
● The input file a1
a2
...am
is organized as
Value 1 Value 2 Value 3 ...
key(a1
) val(a1
) val(a7
) val(a2
) #
key(a5
) val(an
) val(a5
) #
key(a3
) val(am
) val(a2
) val(a3
) ...
...

Contd.
a2
...am
is organized as
key(a1
) val(a1
) val(a7
) val(a2
) #
key(a5
) val(an
) val(a5
) #
key(a3
) val(am
) val(a2
) val(a3
) ...
...
Each row
shares the
same key.

Contd.
a2
...am
is organized as
key(a1
) val(a1
) val(a7
) val(a2
) #
key(a5
) val(an
) val(a5
) #
key(a3
) val(am
) val(a2
) val(a3
) ...
...
Mistake! a2
cannot
appear in two rows.

The Reduce(k,v1
,v2
,...,vd
) Function
● Reduce: (k,v1
,v2
,...,vd
) → v.
Eg: Reduce(“Hello”,“2”,“1”,“5”) = “Hello 8”. (WordCount)
key(s),val(s1
),val(s2
),...,val(sd
)
(a row)

Contd.
Value 1 Value 2 Value 3
“world” “2” “11” “3”
“hello” “10” “0” “5”

Contd.
Sort
“hello” “0” “5” “10”
“world” “2” “3” “11”

Contd.
“hello” “0” “5” “10”
“world” “2” “3” “11”
Reduce()
Reduce()

Parallel Computation
● The table we have seen is global.
● A Map node is assigned a file segment sj
sj+1
...sj+k
, and
executes Map() on each s.
● A Reduce node is associated with one or more rows of
the table, and executes Reduce() on each associated
row.
● Map() and Reduce() execute concurrently on multiple
machines.

WordCount Example
● Input: w = w1
w2
,...wk
.
● Map(w) = {(wj
,“1”)}, j = 1,2,...,k.
● Reduce(w,v1
,v2
,...,vd
) = (w, j
vj
).

Contd.
● w = “cat … dog … bird …”.
● Map(w) = {(wj
,“1”)}, j = 1,2,...,k.
● Reduce(w,v1
,v2
,...,vd
) = (w, j
vj
).
“cat” “1” “1” “1” ...
“dog” “1” “1” “1” ...
“bird” “1” “1” “1” ...

Contd.
● Map(w) = {(wj
,“1”)}, j = 1,2,...,k.
● Reduce(w,v1
,v2
,...,vd
) = (w, j
vj
).
“bird” “1” “1” “1” ...
“cat” “1” “1” “1” ...
“dog” “1” “1” “1” ...

Contd.
● Map(w) = {(wj
,“1”)}, j = 1,2,...,k.
● Reduce(w,v1
,v2
,...,vd
) = (w, j
vj
).
Value 1
“bird” “39”
“cat” “20”
“dog” “11”

The Bursting I/O Problem
● Let N be the file size.
● What would be the table size?

The Bursting I/O Problem
● Let N be the file size.
● What would be the table size?
● At least Ω(N).
○ Each word in the input file corresponds to a value in
the table.
● Too much I/O traffic!

The Combinek
(v1
,v2
,...,vd
) Function
● Goal: to reduce the table size.
● Assumptions:
Combinek
(v) = v,
Combinek
(v1
,...,vd
) = Combinek
(Combinek
(v1
,...,vd-1
),vd
),
Reduce(k,v1
,v2
,...,vd
) = Reduce(k,Combinek
(v1
,...,vd
)).

The Combinek
(v1
,v2
,...,vd
) Function
● Goal: to reduce the table size.
● Assumptions:
Combinek
(v) = v,
Combinek
(v1
,...,vd
) = Combinek
(Combinek
(v1
,...,vd-1
),vd
),
Reduce(k,v1
,v2
,...,vd
) = Reduce(k,Combinek
(v1
,...,vd
)).
● Table size reduction (m Map nodes):
Reduce(k,v1
,v2
,...,vd
) =
Reduce(k,Combinek
(v1
,...,vd/m
),Combinek
(vd/m+1
,...,v2d/m
),...).

Contd.
● Assume m map nodes:
○ Best case: each map node has a combiner.
○ Minimum possible space: ?(m).
“bird” “300” “351” “310” ...
“cat” “109” “1112” “207” ...
“dog” “4” “2” “3” ...

The Partition(k,M) Function
● How to assign rows to reduce nodes?
● Partition: key → node.
● Typically
Partition(k,M) = HashFunction(k) mod M.
Eg.:
Partition(“cat”, 5) = 1 % 5 = 1.

Discussion
● Data Skew Problem
○ A particular Reduce node assigned much more rows
than others.
● Binary File Support
○ What would happen if the file is a binary string?
○ Propose a solution.
● Straggler Detection
○ A Reduce node runs longer than usual.
○ Identify if it is due to a machine-related issue.

狠狠撸

A Computational View of MapReduce

Recommended

More Related Content

Similar to A Computational View of MapReduce (20)

Recently uploaded (20)

A Computational View of MapReduce