際際滷

際際滷Share a Scribd company logo
MAP REDUCE
PROGRAMMING
Dr G Sudha Sadasivam
Map - reduce
? sort/merge based distributed processing
? Best for batch- oriented processing
? Sort/merge is primitive
C Operates at transfer rate (Process+data clusters)
? Simple programming metaphor:
C input | map | shuffle | reduce > output
C cat * | grep | sort | uniq ?
c > file
? Pluggable user code runs in generic reusable framework
C log processing,
-- web search indexing
C SQL like queries in PIG
? Distribution & reliability
C Handled by framework - transparency
MR model
? Map()
C Process a key/value pair to generate intermediate
key/value pairs
? Reduce()
C Merge all intermediate values associated with the same
key
? Users implement interface of two primary methods:
1. Map: (key1, val1) ★ (key2, val2)
2. Reduce: (key2, [val2]) ★ [val3]
? Map - clause group-by (for Key) of an aggregate function
of SQL
? Reduce - aggregate function (e.g., average) that is
computed over all the rows with the same group-by
attribute (key).
? Application writer specifies
C A pair of functions called Map and Reduce and a set of input
files and submits the job
? Workflow
C Input phase generates a number of FileSplits from input files
(one per Map task)
C The Map phase executes a user function to transform input
kev-pairs into a new set of kev-pairs
C The framework sorts & Shuffles the kev-pairs to output nodes
C The Reduce phase combines all kev-pairs with the same key
into new kevpairs
C The output phase writes the resulting pairs to files
? All phases are distributed with many tasks doing the work
C Framework handles scheduling of tasks on cluster
C Framework handles recovery when a node fails
Data distribution
? Input files are split into M pieces on distributed
file system - 128 MB blocks
? Intermediate files created from map tasks are
written to local disk
? Output files are written to distributed file system
Assigning tasks
? Many copies of user program are started
? Tries to utilize data localization by running map
tasks on machines with data
? One instance becomes the Master
? Master finds idle machines and assigns them
tasks
MAP REDUCE PROGRAMMING_using hadoop_a.ppt
Execution
? Map workers read in contents of corresponding input
partition
? Perform user-defined map computation to create
intermediate <key,value> pairs
? Periodically buffered output pairs written to local disk
Reduce
? Reduce workers iterate over ordered intermediate data
C Each unique key encountered C values are passed to
user's reduce function
C eg. <key, [value1, value2,..., valueN]>
? Output of user's reduce function is written to output file
on global file system
? When all tasks have completed, master wakes up user
program
MAP REDUCE PROGRAMMING_using hadoop_a.ppt
MAP REDUCE PROGRAMMING_using hadoop_a.ppt
MAP REDUCE PROGRAMMING_using hadoop_a.ppt
? Map
? Reduce
? Combiner C combines the O/P of a single map
task. Same as reducer, but it stores the
intermediate O/P in a local file wrt final output file
? Debugging
We can test the tasks locally using special Map
reduce libraries
Offers human readable status info on http server
MAP REDUCE PROGRAMMING_using hadoop_a.ppt
WORD COUNT EXAMPLE
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
reduce(String output_key, Iterator
intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
? Map()
C Input <filename, file text>
C Parses file and emits <word, count> pairs
? eg. < ̄hello ̄, 1>
? Reduce()
C Sums all values for the same key and emits
<word, TotalCount>
? eg. < ̄hello ̄, (3 5 2 7)> => < ̄hello ̄, 17>
? File
Hello World Bye World
Hello Hadoop GoodBye Hadoop
? Map
For the given sample input the first map emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
? The second map emits:
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
The output of the first combine:
< Bye, 1>
< Hello, 1>
< World, 2>
The output of the second combine:
< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>
Thus the output of the job (reduce) is:
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
MAP REDUCE PROGRAMMING_using hadoop_a.ppt
MAP REDUCE PROGRAMMING_using hadoop_a.ppt
Configuration
CONCLUSION
Hadoop Map-Reduce is a software framework for
easily writing applications which process vast
amounts of data (multi-terabyte data-sets) in-
parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant
manner.

More Related Content

MAP REDUCE PROGRAMMING_using hadoop_a.ppt

  • 1. MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam
  • 2. Map - reduce ? sort/merge based distributed processing ? Best for batch- oriented processing ? Sort/merge is primitive C Operates at transfer rate (Process+data clusters) ? Simple programming metaphor: C input | map | shuffle | reduce > output C cat * | grep | sort | uniq ? c > file ? Pluggable user code runs in generic reusable framework C log processing, -- web search indexing C SQL like queries in PIG ? Distribution & reliability C Handled by framework - transparency
  • 3. MR model ? Map() C Process a key/value pair to generate intermediate key/value pairs ? Reduce() C Merge all intermediate values associated with the same key ? Users implement interface of two primary methods: 1. Map: (key1, val1) ★ (key2, val2) 2. Reduce: (key2, [val2]) ★ [val3] ? Map - clause group-by (for Key) of an aggregate function of SQL ? Reduce - aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute (key).
  • 4. ? Application writer specifies C A pair of functions called Map and Reduce and a set of input files and submits the job ? Workflow C Input phase generates a number of FileSplits from input files (one per Map task) C The Map phase executes a user function to transform input kev-pairs into a new set of kev-pairs C The framework sorts & Shuffles the kev-pairs to output nodes C The Reduce phase combines all kev-pairs with the same key into new kevpairs C The output phase writes the resulting pairs to files ? All phases are distributed with many tasks doing the work C Framework handles scheduling of tasks on cluster C Framework handles recovery when a node fails
  • 5. Data distribution ? Input files are split into M pieces on distributed file system - 128 MB blocks ? Intermediate files created from map tasks are written to local disk ? Output files are written to distributed file system Assigning tasks ? Many copies of user program are started ? Tries to utilize data localization by running map tasks on machines with data ? One instance becomes the Master ? Master finds idle machines and assigns them tasks
  • 7. Execution ? Map workers read in contents of corresponding input partition ? Perform user-defined map computation to create intermediate <key,value> pairs ? Periodically buffered output pairs written to local disk Reduce ? Reduce workers iterate over ordered intermediate data C Each unique key encountered C values are passed to user's reduce function C eg. <key, [value1, value2,..., valueN]> ? Output of user's reduce function is written to output file on global file system ? When all tasks have completed, master wakes up user program
  • 11. ? Map ? Reduce ? Combiner C combines the O/P of a single map task. Same as reducer, but it stores the intermediate O/P in a local file wrt final output file ? Debugging We can test the tasks locally using special Map reduce libraries Offers human readable status info on http server
  • 14. map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
  • 15. ? Map() C Input <filename, file text> C Parses file and emits <word, count> pairs ? eg. < ̄hello ̄, 1> ? Reduce() C Sums all values for the same key and emits <word, TotalCount> ? eg. < ̄hello ̄, (3 5 2 7)> => < ̄hello ̄, 17>
  • 16. ? File Hello World Bye World Hello Hadoop GoodBye Hadoop ? Map For the given sample input the first map emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> ? The second map emits: < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>
  • 17. The output of the first combine: < Bye, 1> < Hello, 1> < World, 2> The output of the second combine: < Goodbye, 1> < Hadoop, 2> < Hello, 1> Thus the output of the job (reduce) is: < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>
  • 21. CONCLUSION Hadoop Map-Reduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in- parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.