This document discusses MapReduce, a programming model created by Google to simplify large-scale data processing across clusters of computers. MapReduce allows users to express computations over large datasets in a simple way by mapping input key-value pairs to intermediate pairs and then reducing the intermediate pairs. The model handles parallelization, distribution, and load balancing. Examples of problems that can be solved using MapReduce include distributed grep, counting URL access frequencies, and building inverted indexes.
3. Background
Transformation operations are conceptually straightforward
Until data is large and the computation must be
distributed over hundred or thousands of machines
So, Google created MapReduce
MapReduce is a programming abstraction
Expresses simple computations
Hides complexity details
4. Model
Utilizes higher-order shaping functions Map and Reduce to
take a set of input key/value pairs and produce a set of
output key/value pairs
Map
Takes an input key/value pair and produces a set of
intermediate key/value pairs
Reduce
Accepts an intermediate key I and a set of values for
that key, and merges those values to form possibly
smaller sets of values
5. Examples
Distributed Grep
Count of URL Access Frequency
Reverse Web-Link Graph
Term-Vector per Host
Inverted Index
Distributed Sort
7. Conclusions
The MapReduce programming model proved to be a useful
abstraction for many different purposes
Easy to use
even for programmers without experience with
parallel and distributed systems
A large variety of problems are easily expressible as
MapReduce computations
The implementation scales to large clusters of machines
Greatly simplifies large-scale computations at Google