狠狠撸

狠狠撸Share a Scribd company logo
MapReduce
Presented by Zilong Tan
The Data Model
● A (logical) file is a string
a1
a2
...an
,
where aj
is a substring.
The Data Model
● A (logical) file is a string
a1
a2
...an
,
where aj
is a substring.
Eg: “Hellonworld!” ? a1
= “Hello”, a2
= “world!”
“n” is a separator.
The Data Model
● A (logical) file is a string
a1
a2
...an
,
where aj
is a substring.
Eg: “Hellonworld!” ? a1
= “Hello”, a2
= “world!”
“n” is a separator.
● Q1: How to equally split a file?
○ Eg: a1
a2
...a2n
? a1
a2
...an
and an+1
an+2
...a2n
.
The Data Model
● A (logical) file is a string
a1
a2
...an
,
where aj
is a substring.
Eg: “Hellonworld!” ? a1
= “Hello”, a2
= “world!”
“n” is a separator.
● Q1: How to equally split a file?
○ Eg: a1
a2
...a2n
? a1
a2
...an
and an+1
an+2
...a2n
.
● Q2: What about splitting the file into more segments?
The Map(aj
) Function
● Map: aj
→ {(key(aj
), val(aj
))}
● key(aj
) and val(aj
) are strings.
Eg: Map(“Hello”) = (“Hello”, “1”),
Map(“Hello world”) = {(“Hello”,“1”), (“world”,“1”)},
Map(“Hello world”) = (“world”, “Hello”).
Contd.
● The input file a1
a2
...am
is organized as
Value 1 Value 2 Value 3 ...
key(a1
) val(a1
) val(a7
) val(a2
) #
key(a5
) val(an
) val(a5
) #
key(a3
) val(am
) val(a2
) val(a3
) ...
...
Contd.
● The input file a1
a2
...am
is organized as
Value 1 Value 2 Value 3 ...
key(a1
) val(a1
) val(a7
) val(a2
) #
key(a5
) val(an
) val(a5
) #
key(a3
) val(am
) val(a2
) val(a3
) ...
...
Each row
shares the
same key.
Contd.
● The input file a1
a2
...am
is organized as
Value 1 Value 2 Value 3 ...
key(a1
) val(a1
) val(a7
) val(a2
) #
key(a5
) val(an
) val(a5
) #
key(a3
) val(am
) val(a2
) val(a3
) ...
...
Mistake! a2
cannot
appear in two rows.
The Reduce(k,v1
,v2
,...,vd
) Function
● Reduce: (k,v1
,v2
,...,vd
) → v.
Eg: Reduce(“Hello”,“2”,“1”,“5”) = “Hello 8”. (WordCount)
key(s),val(s1
),val(s2
),...,val(sd
)
(a row)
Contd.
Value 1 Value 2 Value 3
“world” “2” “11” “3”
“hello” “10” “0” “5”
Contd.
Sort
Value 1 Value 2 Value 3
“hello” “0” “5” “10”
“world” “2” “3” “11”
Contd.
Value 1 Value 2 Value 3
“hello” “0” “5” “10”
“world” “2” “3” “11”
Reduce()
Reduce()
Parallel Computation
● The table we have seen is global.
● A Map node is assigned a file segment sj
sj+1
...sj+k
, and
executes Map() on each s.
● A Reduce node is associated with one or more rows of
the table, and executes Reduce() on each associated
row.
● Map() and Reduce() execute concurrently on multiple
machines.
WordCount Example
● Input: w = w1
w2
,...wk
.
● Map(w) = {(wj
,“1”)}, j = 1,2,...,k.
● Reduce(w,v1
,v2
,...,vd
) = (w, j
vj
).
Contd.
● w = “cat … dog … bird …”.
● Map(w) = {(wj
,“1”)}, j = 1,2,...,k.
● Reduce(w,v1
,v2
,...,vd
) = (w, j
vj
).
Value 1 Value 2 Value 3 ...
“cat” “1” “1” “1” ...
“dog” “1” “1” “1” ...
“bird” “1” “1” “1” ...
Contd.
● w = “cat … dog … bird …”.
● Map(w) = {(wj
,“1”)}, j = 1,2,...,k.
● Reduce(w,v1
,v2
,...,vd
) = (w, j
vj
).
Value 1 Value 2 Value 3 ...
“bird” “1” “1” “1” ...
“cat” “1” “1” “1” ...
“dog” “1” “1” “1” ...
Contd.
● w = “cat … dog … bird …”.
● Map(w) = {(wj
,“1”)}, j = 1,2,...,k.
● Reduce(w,v1
,v2
,...,vd
) = (w, j
vj
).
Value 1
“bird” “39”
“cat” “20”
“dog” “11”
The Bursting I/O Problem
● Let N be the file size.
● What would be the table size?
The Bursting I/O Problem
● Let N be the file size.
● What would be the table size?
● At least Ω(N).
○ Each word in the input file corresponds to a value in
the table.
● Too much I/O traffic!
The Combinek
(v1
,v2
,...,vd
) Function
● Goal: to reduce the table size.
● Assumptions:
Combinek
(v) = v,
Combinek
(v1
,...,vd
) = Combinek
(Combinek
(v1
,...,vd-1
),vd
),
Reduce(k,v1
,v2
,...,vd
) = Reduce(k,Combinek
(v1
,...,vd
)).
The Combinek
(v1
,v2
,...,vd
) Function
● Goal: to reduce the table size.
● Assumptions:
Combinek
(v) = v,
Combinek
(v1
,...,vd
) = Combinek
(Combinek
(v1
,...,vd-1
),vd
),
Reduce(k,v1
,v2
,...,vd
) = Reduce(k,Combinek
(v1
,...,vd
)).
● Table size reduction (m Map nodes):
Reduce(k,v1
,v2
,...,vd
) =
Reduce(k,Combinek
(v1
,...,vd/m
),Combinek
(vd/m+1
,...,v2d/m
),...).
Contd.
● Assume m map nodes:
○ Best case: each map node has a combiner.
○ Minimum possible space: ?(m).
Value 1 Value 2 Value 3 ...
“bird” “300” “351” “310” ...
“cat” “109” “1112” “207” ...
“dog” “4” “2” “3” ...
The Partition(k,M) Function
● How to assign rows to reduce nodes?
● Partition: key → node.
● Typically
Partition(k,M) = HashFunction(k) mod M.
Eg.:
Partition(“cat”, 5) = 1 % 5 = 1.
Discussion
● Data Skew Problem
○ A particular Reduce node assigned much more rows
than others.
● Binary File Support
○ What would happen if the file is a binary string?
○ Propose a solution.
● Straggler Detection
○ A Reduce node runs longer than usual.
○ Identify if it is due to a machine-related issue.

More Related Content

Similar to A Computational View of MapReduce (20)

Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Matthew Lease
?
Kotlin: maybe it's the right time
Kotlin: maybe it's the right timeKotlin: maybe it's the right time
Kotlin: maybe it's the right time
Davide Cerbo
?
Exercises+Lab.ppt - minimum node cover - exam
Exercises+Lab.ppt - minimum node cover  - examExercises+Lab.ppt - minimum node cover  - exam
Exercises+Lab.ppt - minimum node cover - exam
kaliaragorn
?
Dynamic Programming.pptx
Dynamic Programming.pptxDynamic Programming.pptx
Dynamic Programming.pptx
Thanga Ramya S
?
Sets, maps and hash tables (Java Collections)
Sets, maps and hash tables (Java Collections)Sets, maps and hash tables (Java Collections)
Sets, maps and hash tables (Java Collections)
Fulvio Corno
?
Class 31: Deanonymizing
Class 31: DeanonymizingClass 31: Deanonymizing
Class 31: Deanonymizing
David Evans
?
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Matthew Lease
?
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilind
EMC
?
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from Scratch
Ahmed BESBES
?
Regularized Estimation of Spatial Patterns
Regularized Estimation of Spatial PatternsRegularized Estimation of Spatial Patterns
Regularized Estimation of Spatial Patterns
Wen-Ting Wang
?
jq: JSON - Like a Boss
jq: JSON - Like a Bossjq: JSON - Like a Boss
jq: JSON - Like a Boss
Bob Tiernay
?
Os Peytonjones
Os PeytonjonesOs Peytonjones
Os Peytonjones
oscon2007
?
0157a649fc5aaa7d2ff318b9b9b80d2e_MIT6_02F12_lec09.pdf
0157a649fc5aaa7d2ff318b9b9b80d2e_MIT6_02F12_lec09.pdf0157a649fc5aaa7d2ff318b9b9b80d2e_MIT6_02F12_lec09.pdf
0157a649fc5aaa7d2ff318b9b9b80d2e_MIT6_02F12_lec09.pdf
sorinmskrammer
?
Exploring ZIO Prelude: The game changer for typeclasses in Scala
Exploring ZIO Prelude: The game changer for typeclasses in ScalaExploring ZIO Prelude: The game changer for typeclasses in Scala
Exploring ZIO Prelude: The game changer for typeclasses in Scala
Jorge Vásquez
?
Scope Graphs: A fresh look at name binding in programming languages
Scope Graphs: A fresh look at name binding in programming languagesScope Graphs: A fresh look at name binding in programming languages
Scope Graphs: A fresh look at name binding in programming languages
Eelco Visser
?
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
GiannisPagges
?
Threequals - Case Equality in Ruby
Threequals - Case Equality in RubyThreequals - Case Equality in Ruby
Threequals - Case Equality in Ruby
Louis Scoras
?
Algo-Exercises-2-hash-AVL-Tree.ppt
Algo-Exercises-2-hash-AVL-Tree.pptAlgo-Exercises-2-hash-AVL-Tree.ppt
Algo-Exercises-2-hash-AVL-Tree.ppt
HebaSamy22
?
Declarative Language Definition
Declarative Language DefinitionDeclarative Language Definition
Declarative Language Definition
Eelco Visser
?
Topology Matters in Communication
Topology Matters in CommunicationTopology Matters in Communication
Topology Matters in Communication
cseiitgn
?
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Matthew Lease
?
Kotlin: maybe it's the right time
Kotlin: maybe it's the right timeKotlin: maybe it's the right time
Kotlin: maybe it's the right time
Davide Cerbo
?
Exercises+Lab.ppt - minimum node cover - exam
Exercises+Lab.ppt - minimum node cover  - examExercises+Lab.ppt - minimum node cover  - exam
Exercises+Lab.ppt - minimum node cover - exam
kaliaragorn
?
Sets, maps and hash tables (Java Collections)
Sets, maps and hash tables (Java Collections)Sets, maps and hash tables (Java Collections)
Sets, maps and hash tables (Java Collections)
Fulvio Corno
?
Class 31: Deanonymizing
Class 31: DeanonymizingClass 31: Deanonymizing
Class 31: Deanonymizing
David Evans
?
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Matthew Lease
?
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilind
EMC
?
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from Scratch
Ahmed BESBES
?
Regularized Estimation of Spatial Patterns
Regularized Estimation of Spatial PatternsRegularized Estimation of Spatial Patterns
Regularized Estimation of Spatial Patterns
Wen-Ting Wang
?
jq: JSON - Like a Boss
jq: JSON - Like a Bossjq: JSON - Like a Boss
jq: JSON - Like a Boss
Bob Tiernay
?
0157a649fc5aaa7d2ff318b9b9b80d2e_MIT6_02F12_lec09.pdf
0157a649fc5aaa7d2ff318b9b9b80d2e_MIT6_02F12_lec09.pdf0157a649fc5aaa7d2ff318b9b9b80d2e_MIT6_02F12_lec09.pdf
0157a649fc5aaa7d2ff318b9b9b80d2e_MIT6_02F12_lec09.pdf
sorinmskrammer
?
Exploring ZIO Prelude: The game changer for typeclasses in Scala
Exploring ZIO Prelude: The game changer for typeclasses in ScalaExploring ZIO Prelude: The game changer for typeclasses in Scala
Exploring ZIO Prelude: The game changer for typeclasses in Scala
Jorge Vásquez
?
Scope Graphs: A fresh look at name binding in programming languages
Scope Graphs: A fresh look at name binding in programming languagesScope Graphs: A fresh look at name binding in programming languages
Scope Graphs: A fresh look at name binding in programming languages
Eelco Visser
?
Threequals - Case Equality in Ruby
Threequals - Case Equality in RubyThreequals - Case Equality in Ruby
Threequals - Case Equality in Ruby
Louis Scoras
?
Algo-Exercises-2-hash-AVL-Tree.ppt
Algo-Exercises-2-hash-AVL-Tree.pptAlgo-Exercises-2-hash-AVL-Tree.ppt
Algo-Exercises-2-hash-AVL-Tree.ppt
HebaSamy22
?
Declarative Language Definition
Declarative Language DefinitionDeclarative Language Definition
Declarative Language Definition
Eelco Visser
?
Topology Matters in Communication
Topology Matters in CommunicationTopology Matters in Communication
Topology Matters in Communication
cseiitgn
?

Recently uploaded (20)

Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
?
Customer Segmentation using K-Means clustering
Customer Segmentation using K-Means clusteringCustomer Segmentation using K-Means clustering
Customer Segmentation using K-Means clustering
Ingrid Nyakerario
?
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
?
Modern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx AaModern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx Aa
MuhammadAwaisKamboh
?
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
?
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
?
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
?
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
?
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptxISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
pankaj6188303
?
Deloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining ProjectsDeloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining Projects
Process mining Evangelist
?
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
?
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
?
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
?
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docxMASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
santosh162
?
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
?
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
?
4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf
axonneurologycenter1
?
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
?
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
?
定制学历(美国笔耻谤诲耻别毕业证)普渡大学电子版毕业证
定制学历(美国笔耻谤诲耻别毕业证)普渡大学电子版毕业证定制学历(美国笔耻谤诲耻别毕业证)普渡大学电子版毕业证
定制学历(美国笔耻谤诲耻别毕业证)普渡大学电子版毕业证
Taqyea
?
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
?
Customer Segmentation using K-Means clustering
Customer Segmentation using K-Means clusteringCustomer Segmentation using K-Means clustering
Customer Segmentation using K-Means clustering
Ingrid Nyakerario
?
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
?
Modern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx AaModern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx Aa
MuhammadAwaisKamboh
?
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
?
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
?
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
?
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
?
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptxISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
ISO 9001_2015 FINALaaaaaaaaaaaaaaaa - MDX - Copy.pptx
pankaj6188303
?
Deloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining ProjectsDeloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining Projects
Process mining Evangelist
?
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
?
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
?
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
?
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docxMASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
santosh162
?
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
?
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
?
4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf
axonneurologycenter1
?
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
?
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
?
定制学历(美国笔耻谤诲耻别毕业证)普渡大学电子版毕业证
定制学历(美国笔耻谤诲耻别毕业证)普渡大学电子版毕业证定制学历(美国笔耻谤诲耻别毕业证)普渡大学电子版毕业证
定制学历(美国笔耻谤诲耻别毕业证)普渡大学电子版毕业证
Taqyea
?

A Computational View of MapReduce

  • 2. The Data Model ● A (logical) file is a string a1 a2 ...an , where aj is a substring.
  • 3. The Data Model ● A (logical) file is a string a1 a2 ...an , where aj is a substring. Eg: “Hellonworld!” ? a1 = “Hello”, a2 = “world!” “n” is a separator.
  • 4. The Data Model ● A (logical) file is a string a1 a2 ...an , where aj is a substring. Eg: “Hellonworld!” ? a1 = “Hello”, a2 = “world!” “n” is a separator. ● Q1: How to equally split a file? ○ Eg: a1 a2 ...a2n ? a1 a2 ...an and an+1 an+2 ...a2n .
  • 5. The Data Model ● A (logical) file is a string a1 a2 ...an , where aj is a substring. Eg: “Hellonworld!” ? a1 = “Hello”, a2 = “world!” “n” is a separator. ● Q1: How to equally split a file? ○ Eg: a1 a2 ...a2n ? a1 a2 ...an and an+1 an+2 ...a2n . ● Q2: What about splitting the file into more segments?
  • 6. The Map(aj ) Function ● Map: aj → {(key(aj ), val(aj ))} ● key(aj ) and val(aj ) are strings. Eg: Map(“Hello”) = (“Hello”, “1”), Map(“Hello world”) = {(“Hello”,“1”), (“world”,“1”)}, Map(“Hello world”) = (“world”, “Hello”).
  • 7. Contd. ● The input file a1 a2 ...am is organized as Value 1 Value 2 Value 3 ... key(a1 ) val(a1 ) val(a7 ) val(a2 ) # key(a5 ) val(an ) val(a5 ) # key(a3 ) val(am ) val(a2 ) val(a3 ) ... ...
  • 8. Contd. ● The input file a1 a2 ...am is organized as Value 1 Value 2 Value 3 ... key(a1 ) val(a1 ) val(a7 ) val(a2 ) # key(a5 ) val(an ) val(a5 ) # key(a3 ) val(am ) val(a2 ) val(a3 ) ... ... Each row shares the same key.
  • 9. Contd. ● The input file a1 a2 ...am is organized as Value 1 Value 2 Value 3 ... key(a1 ) val(a1 ) val(a7 ) val(a2 ) # key(a5 ) val(an ) val(a5 ) # key(a3 ) val(am ) val(a2 ) val(a3 ) ... ... Mistake! a2 cannot appear in two rows.
  • 10. The Reduce(k,v1 ,v2 ,...,vd ) Function ● Reduce: (k,v1 ,v2 ,...,vd ) → v. Eg: Reduce(“Hello”,“2”,“1”,“5”) = “Hello 8”. (WordCount) key(s),val(s1 ),val(s2 ),...,val(sd ) (a row)
  • 11. Contd. Value 1 Value 2 Value 3 “world” “2” “11” “3” “hello” “10” “0” “5”
  • 12. Contd. Sort Value 1 Value 2 Value 3 “hello” “0” “5” “10” “world” “2” “3” “11”
  • 13. Contd. Value 1 Value 2 Value 3 “hello” “0” “5” “10” “world” “2” “3” “11” Reduce() Reduce()
  • 14. Parallel Computation ● The table we have seen is global. ● A Map node is assigned a file segment sj sj+1 ...sj+k , and executes Map() on each s. ● A Reduce node is associated with one or more rows of the table, and executes Reduce() on each associated row. ● Map() and Reduce() execute concurrently on multiple machines.
  • 15. WordCount Example ● Input: w = w1 w2 ,...wk . ● Map(w) = {(wj ,“1”)}, j = 1,2,...,k. ● Reduce(w,v1 ,v2 ,...,vd ) = (w, j vj ).
  • 16. Contd. ● w = “cat … dog … bird …”. ● Map(w) = {(wj ,“1”)}, j = 1,2,...,k. ● Reduce(w,v1 ,v2 ,...,vd ) = (w, j vj ). Value 1 Value 2 Value 3 ... “cat” “1” “1” “1” ... “dog” “1” “1” “1” ... “bird” “1” “1” “1” ...
  • 17. Contd. ● w = “cat … dog … bird …”. ● Map(w) = {(wj ,“1”)}, j = 1,2,...,k. ● Reduce(w,v1 ,v2 ,...,vd ) = (w, j vj ). Value 1 Value 2 Value 3 ... “bird” “1” “1” “1” ... “cat” “1” “1” “1” ... “dog” “1” “1” “1” ...
  • 18. Contd. ● w = “cat … dog … bird …”. ● Map(w) = {(wj ,“1”)}, j = 1,2,...,k. ● Reduce(w,v1 ,v2 ,...,vd ) = (w, j vj ). Value 1 “bird” “39” “cat” “20” “dog” “11”
  • 19. The Bursting I/O Problem ● Let N be the file size. ● What would be the table size?
  • 20. The Bursting I/O Problem ● Let N be the file size. ● What would be the table size? ● At least Ω(N). ○ Each word in the input file corresponds to a value in the table. ● Too much I/O traffic!
  • 21. The Combinek (v1 ,v2 ,...,vd ) Function ● Goal: to reduce the table size. ● Assumptions: Combinek (v) = v, Combinek (v1 ,...,vd ) = Combinek (Combinek (v1 ,...,vd-1 ),vd ), Reduce(k,v1 ,v2 ,...,vd ) = Reduce(k,Combinek (v1 ,...,vd )).
  • 22. The Combinek (v1 ,v2 ,...,vd ) Function ● Goal: to reduce the table size. ● Assumptions: Combinek (v) = v, Combinek (v1 ,...,vd ) = Combinek (Combinek (v1 ,...,vd-1 ),vd ), Reduce(k,v1 ,v2 ,...,vd ) = Reduce(k,Combinek (v1 ,...,vd )). ● Table size reduction (m Map nodes): Reduce(k,v1 ,v2 ,...,vd ) = Reduce(k,Combinek (v1 ,...,vd/m ),Combinek (vd/m+1 ,...,v2d/m ),...).
  • 23. Contd. ● Assume m map nodes: ○ Best case: each map node has a combiner. ○ Minimum possible space: ?(m). Value 1 Value 2 Value 3 ... “bird” “300” “351” “310” ... “cat” “109” “1112” “207” ... “dog” “4” “2” “3” ...
  • 24. The Partition(k,M) Function ● How to assign rows to reduce nodes? ● Partition: key → node. ● Typically Partition(k,M) = HashFunction(k) mod M. Eg.: Partition(“cat”, 5) = 1 % 5 = 1.
  • 25. Discussion ● Data Skew Problem ○ A particular Reduce node assigned much more rows than others. ● Binary File Support ○ What would happen if the file is a binary string? ○ Propose a solution. ● Straggler Detection ○ A Reduce node runs longer than usual. ○ Identify if it is due to a machine-related issue.