�ݺ�ߣ

MapReduce: Simplified Data
Processing on Large Clusters
Rob Keisler
CSCI 638
Summer 2011

Outline

● Background

● Model

● Examples

● Execution

● Conclusions

Background

● Transformation operations are conceptually straightforward
○ Until data is large and the computation must be
distributed over hundred or thousands of machines

● So, Google created MapReduce

● MapReduce is a programming abstraction
○ Expresses simple computations
○ Hides complexity details

Model

● Utilizes higher-order shaping functions Map and Reduce to
take a set of input key/value pairs and produce a set of
output key/value pairs

● Map
○ Takes an input key/value pair and produces a set of
intermediate key/value pairs

● Reduce
○ Accepts an intermediate key I and a set of values for
that key, and merges those values to form possibly
smaller sets of values

Examples

● Distributed Grep

● Count of URL Access Frequency

● Reverse Web-Link Graph

● Term-Vector per Host

● Inverted Index

● Distributed Sort

Conclusions

● The MapReduce programming model proved to be a useful
abstraction for many different purposes
○ Easy to use
■ even for programmers without experience with
parallel and distributed systems
○ A large variety of problems are easily expressible as
MapReduce computations
○ The implementation scales to large clusters of machines

● Greatly simplifies large-scale computations at Google

Questions?

http://labs.google.com/papers/mapreduce.html

�ݺ�ߣ

MapReduce

More Related Content

MapReduce