際際滷

際際滷Share a Scribd company logo
HDFS, Map Reduce & Hadoop 1.0 Vs 2.0 Overview
HDFS Architecture
 HDFS stands for Hadoop Distributed File System
 HDFS was originally built as infrastructure forthe Apache Nutch web search engine project
 HDFS is now an Apache Hadoop sub project
 A typical file in HDFS is gigabytes to terabytes in size
 HDFS applications need a write-once-read-many access model for files. This assumption
simplifies data coherency issues and enables high throughput data access
 HDFS has master/slave architecture: NameNode/Datanode
 An HDFS cluster consists of a single NameNode and a number of Datanode
 The NameNode and Datanode are pieces of softwaredesigned to run on commodity machines.
These machines typically run a GNU/Linux operating system (OS)
 Datanode, usually one per node in the cluster, whichmanage storage attached to the nodes that
they run on. They are responsible for serving read and write requests from the file systems
clients. They also perform blockcreation, deletion, and replication upon instruction fromthe
NameNode.
 HDFS exposes a file system namespace and allows user data tobe stored in files. Internally, a file
is split into one or more blocks and these blocksare stored in a set of Datanode
NameNode
Namenode holds the Meta data for the HDFS like Namespace information, block information etc. When
in use, all this information is stored in main memory. But this information also stored in disk for
persistence storage.
The above image shows how Name Node stores information in disk.
Twodifferent files are
 fsimage - Its the snapshot of the file system when Namenode started
 Edit logs - Its the sequence of changes made tothe file system after Namenode started
Only in the restart of Namenode, edit logs are applied to fsimage to get the latest snapshot of the file
system. But Namenode restart are rare in production clusters which means edit logs can grow very
large for the clusters where Namenode runs for a long period of time. The following issues we will
encounter in this situation.
 Edit log become very large , which willbe challenging to manage it
 Namenode restart takes long time because lot of changes has to be merged
 In the case of crash, we willlost huge amount of metadata since fsimage is very old
So to overcome this issues we need a mechanism which will help us reduce the edit log size which is
manageable and have up to date fsimage ,so that load on Namenode reduces . Its very similar to
Windows Restore point, which will allow us to take snapshot of the OS so that if something goes wrong,
we can fall back to the last restore point.
So now we understood NameNode functionality and challenge to keep the Meta data up to date. So what
is this all have to withSecondary Namenode?
Secondary Namenode
Secondary Namenode helps to overcomethe aboveissues by taking over responsibility of merging edit
logs withfsimage from the Namenode.
The above figure shows the workingof Secondary Namenode
 It gets the edit logs from the Namenode in regular intervals and applies to fsimage
 Once it has new fsimage, it copies back to Namenode
 Namenode will use this fsimage for the next restart, whichwill reduce the start-up time
Secondary Namenode whole purpose is to have a checkpoint in HDFS. Its just a helper node for
Namenode. Thats why it also known as checkpoint node inside the community.
So we now understood all Secondary Namenode does put a checkpoint in file system which will help
Namenode to function better. Its not the replacement or backup for the Namenode. So from now on
make a habit of calling it as a checkpoint node.
MapReduce
Mapreduce is a framework using which we can write applications to process huge amounts of data, in
parallel, on large clusters of commodity hardware in a reliable manner.
Mapreduce is a processing technique and a program model for distributed computing based on java.
The Mapreduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of
data and converts it into another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples. As the sequence of the name Mapreduce implies, the
reduce task is always performed after the map job.
Mapreduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
 Map stage: The map or mappers job is to process the input data. Generally the input data
is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input
file is passed to the mapper function line by line. The mapper processes the data and
creates several small chunks of data.
 Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage.
The Reducers job is to process the data that comes from the mapper. After processing, it
produces a new set of output, which will be stored in the HDFS.
During a Mapreduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the
cluster. The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes. Most of the computing takes place
on nodes with data on local disks that reduces the network traffic. After completion of the given tasks,
the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop
server.
Timelines
Year Month Event
2003 October Google File System paper released
2006 January Hadoop is born from Nutch 197
2006 February Hadoop is named after Cutting's son's yellow plush toy
2006 April Hadoop 0.1.0 released
2006 April Hadoop sorts 1.8 TB on 188 nodes in 47.9 hours
2008 March First Hadoop Summit
2008 April
Hadoop world record fastest system to sort a terabyte of data. Running on a 910-
node cluster, Hadoop sorted one terabyte in 209 seconds
2008 May Hadoop wins TeraByte Sort (World Record sortbenchmark.org)
2008 July Hadoop wins Terabyte Sort Benchmark
2008 November Google MapReduce implementation sorted one terabyte in 68 seconds
2009 May Yahoo! used Hadoop to sort one terabyte in 62 seconds
2012 November Apache Hadoop 1.0 Available
Hadoop1VsHadoop2
S
No
Hadoop1 Hadoop2
2 MR does both processing and cluster-
resource management.
YARN (YetAnother Resource Negotiator) does
cluster resource management and processing is
done using different processing models.
3 Has limited scaling of nodes. Limited to 4000
nodes per cluster
Has better scalability. Scalable up to 10000
nodes per cluster
4 Works on concepts of slots  slots can run
either a Map task or a Reduce task only.
Works on concepts of containers. Using
containers can run generic tasks.
5 A single Namenode to manage the entire
namespace.
Multiple Namenode servers manage multiple
namespaces.
6 Has Single-Point-of-Failure (SPOF)  because
of single Namenode- and in the case
of Namenode failure, needs manual
intervention to overcome.
Has to feature to overcomeSPOF witha standby
Namenode and in the case of Namenode failure,
it is configured forautomatic recovery.
7 MR API is compatible withHadoop1x. A
program written in Hadoop1 executes
in Hadoop1x without any additional files.
MR API requires additional files for a program
written in Hadoop1x to execute in Hadoop2x.
9 A Namenode failure affectsthe stack. The Hadoop stack  Hive, Pig, HBase etc. are all
equipped to handle Namenode failure.

More Related Content

What's hot (19)

Hadoop
HadoopHadoop
Hadoop
Dinakar nk
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
rohitraj268
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
EasyMedico.com
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
Uday Vakalapudi
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
Bhavesh Padharia
Cppt
CpptCppt
Cppt
chunkypandey12
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
Shashwat Shriparv
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
fvanvollenhoven
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
Hafizur Rahman
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
Ravindra Bandara
Introduction to Flume
Introduction to FlumeIntroduction to Flume
Introduction to Flume
Rupak Roy
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
Rommel Garcia
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
Apache Apex
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase client
Shashwat Shriparv
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
Vigen Sahakyan
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
techieguy85
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
Hadoop migration and upgradation
Hadoop migration and upgradationHadoop migration and upgradation
Hadoop migration and upgradation
Shashwat Shriparv
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
rohitraj268
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
Uday Vakalapudi
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
Shashwat Shriparv
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
fvanvollenhoven
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
Hafizur Rahman
Introduction to Flume
Introduction to FlumeIntroduction to Flume
Introduction to Flume
Rupak Roy
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
Apache Apex
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase client
Shashwat Shriparv
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
techieguy85
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
Hadoop migration and upgradation
Hadoop migration and upgradationHadoop migration and upgradation
Hadoop migration and upgradation
Shashwat Shriparv

Similar to Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview (20)

hadoop
hadoophadoop
hadoop
swatic018
hadoop
hadoophadoop
hadoop
swatic018
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
Sunil D Patil
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Mr. Ankit
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
KamranKhan587
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
chunkypandey12
Cppt
CpptCppt
Cppt
chunkypandey12
Hadoop Architecture.pptx
Hadoop  Architecture.pptxHadoop  Architecture.pptx
Hadoop Architecture.pptx
SakthiVinoth78
Hadoop File System was developed using distributed file system design.
Hadoop File System was developed using distributed file system design.Hadoop File System was developed using distributed file system design.
Hadoop File System was developed using distributed file system design.
JSujatha2
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
Uttara University
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
IJRESJOURNAL
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
Rupak Roy
hdfs readrmation ghghg bigdats analytics info.pdf
hdfs readrmation ghghg bigdats analytics info.pdfhdfs readrmation ghghg bigdats analytics info.pdf
hdfs readrmation ghghg bigdats analytics info.pdf
ssuser2d043c
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
Santosh Nage
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
Jay Nagar
Hadoop file system
Hadoop file systemHadoop file system
Hadoop file system
John Veigas
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoop
RexRamos9
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
DIVYA370851
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
Sunil D Patil
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Mr. Ankit
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
KamranKhan587
Hadoop Architecture.pptx
Hadoop  Architecture.pptxHadoop  Architecture.pptx
Hadoop Architecture.pptx
SakthiVinoth78
Hadoop File System was developed using distributed file system design.
Hadoop File System was developed using distributed file system design.Hadoop File System was developed using distributed file system design.
Hadoop File System was developed using distributed file system design.
JSujatha2
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
Uttara University
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
IJRESJOURNAL
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
Rupak Roy
hdfs readrmation ghghg bigdats analytics info.pdf
hdfs readrmation ghghg bigdats analytics info.pdfhdfs readrmation ghghg bigdats analytics info.pdf
hdfs readrmation ghghg bigdats analytics info.pdf
ssuser2d043c
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
Santosh Nage
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
Jay Nagar
Hadoop file system
Hadoop file systemHadoop file system
Hadoop file system
John Veigas
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoop
RexRamos9
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
DIVYA370851

Recently uploaded (20)

South Hornsey: The Lost Local Authority that Merged with Stoke Newington by T...
South Hornsey: The Lost Local Authority that Merged with Stoke Newington by T...South Hornsey: The Lost Local Authority that Merged with Stoke Newington by T...
South Hornsey: The Lost Local Authority that Merged with Stoke Newington by T...
History of Stoke Newington
EDL 290F Week 3 - Mountaintop Views (2025).pdf
EDL 290F Week 3  - Mountaintop Views (2025).pdfEDL 290F Week 3  - Mountaintop Views (2025).pdf
EDL 290F Week 3 - Mountaintop Views (2025).pdf
Liz Walsh-Trevino
PUBH1000 Module 3: Public Health Systems
PUBH1000 Module 3: Public Health SystemsPUBH1000 Module 3: Public Health Systems
PUBH1000 Module 3: Public Health Systems
Jonathan Hallett
The Battle of Belgrade Road: A WW1 Street Renaming Saga by Amir Dotan
The Battle of Belgrade Road: A WW1 Street Renaming Saga by Amir DotanThe Battle of Belgrade Road: A WW1 Street Renaming Saga by Amir Dotan
The Battle of Belgrade Road: A WW1 Street Renaming Saga by Amir Dotan
History of Stoke Newington
How to Setup WhatsApp in Odoo 17 - Odoo 際際滷s
How to Setup WhatsApp in Odoo 17 - Odoo 際際滷sHow to Setup WhatsApp in Odoo 17 - Odoo 際際滷s
How to Setup WhatsApp in Odoo 17 - Odoo 際際滷s
Celine George
How to Configure Restaurants in Odoo 17 Point of Sale
How to Configure Restaurants in Odoo 17 Point of SaleHow to Configure Restaurants in Odoo 17 Point of Sale
How to Configure Restaurants in Odoo 17 Point of Sale
Celine George
How to Modify Existing Web Pages in Odoo 18
How to Modify Existing Web Pages in Odoo 18How to Modify Existing Web Pages in Odoo 18
How to Modify Existing Web Pages in Odoo 18
Celine George
DUBLIN PROGRAM DUBLIN PROGRAM DUBLIN PROGRAM
DUBLIN PROGRAM DUBLIN PROGRAM DUBLIN PROGRAMDUBLIN PROGRAM DUBLIN PROGRAM DUBLIN PROGRAM
DUBLIN PROGRAM DUBLIN PROGRAM DUBLIN PROGRAM
vlckovar
Research & Research Methods: Basic Concepts and Types.pptx
Research & Research Methods: Basic Concepts and Types.pptxResearch & Research Methods: Basic Concepts and Types.pptx
Research & Research Methods: Basic Concepts and Types.pptx
Dr. Sarita Anand
How to use Init Hooks in Odoo 18 - Odoo 際際滷s
How to use Init Hooks in Odoo 18 - Odoo 際際滷sHow to use Init Hooks in Odoo 18 - Odoo 際際滷s
How to use Init Hooks in Odoo 18 - Odoo 際際滷s
Celine George
Modeling-Simple-Equation-Using-Bar-Models.pptx
Modeling-Simple-Equation-Using-Bar-Models.pptxModeling-Simple-Equation-Using-Bar-Models.pptx
Modeling-Simple-Equation-Using-Bar-Models.pptx
maribethlacno2
TLE 7 - 2nd Topic - Codes and Standards in Industrial Arts Services.pptx
TLE 7 - 2nd Topic - Codes and Standards in Industrial Arts Services.pptxTLE 7 - 2nd Topic - Codes and Standards in Industrial Arts Services.pptx
TLE 7 - 2nd Topic - Codes and Standards in Industrial Arts Services.pptx
RizaBedayo
Storytelling instructions...............
Storytelling instructions...............Storytelling instructions...............
Storytelling instructions...............
Alexander Benito
QuickBooks Desktop to QuickBooks Online How to Make the Move
QuickBooks Desktop to QuickBooks Online  How to Make the MoveQuickBooks Desktop to QuickBooks Online  How to Make the Move
QuickBooks Desktop to QuickBooks Online How to Make the Move
TechSoup
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
heathfieldcps1
FESTIVAL: SINULOG & THINGYAN-LESSON 4.pptx
FESTIVAL: SINULOG & THINGYAN-LESSON 4.pptxFESTIVAL: SINULOG & THINGYAN-LESSON 4.pptx
FESTIVAL: SINULOG & THINGYAN-LESSON 4.pptx
DanmarieMuli1
Lesson Plan M1 2024 Lesson Plan M1 2024 Lesson Plan M1 2024 Lesson Plan M1...
Lesson Plan M1 2024  Lesson Plan M1 2024  Lesson Plan M1 2024  Lesson Plan M1...Lesson Plan M1 2024  Lesson Plan M1 2024  Lesson Plan M1 2024  Lesson Plan M1...
Lesson Plan M1 2024 Lesson Plan M1 2024 Lesson Plan M1 2024 Lesson Plan M1...
pinkdvil200
The Constitution, Government and Law making bodies .
The Constitution, Government and Law making bodies .The Constitution, Government and Law making bodies .
The Constitution, Government and Law making bodies .
saanidhyapatel09
APM People Interest Network Conference - Oliver Randall & David Bovis - Own Y...
APM People Interest Network Conference - Oliver Randall & David Bovis - Own Y...APM People Interest Network Conference - Oliver Randall & David Bovis - Own Y...
APM People Interest Network Conference - Oliver Randall & David Bovis - Own Y...
Association for Project Management
Principle and Practices of Animal Breeding || Boby Basnet
Principle and Practices of Animal Breeding || Boby BasnetPrinciple and Practices of Animal Breeding || Boby Basnet
Principle and Practices of Animal Breeding || Boby Basnet
Boby Basnet
South Hornsey: The Lost Local Authority that Merged with Stoke Newington by T...
South Hornsey: The Lost Local Authority that Merged with Stoke Newington by T...South Hornsey: The Lost Local Authority that Merged with Stoke Newington by T...
South Hornsey: The Lost Local Authority that Merged with Stoke Newington by T...
History of Stoke Newington
EDL 290F Week 3 - Mountaintop Views (2025).pdf
EDL 290F Week 3  - Mountaintop Views (2025).pdfEDL 290F Week 3  - Mountaintop Views (2025).pdf
EDL 290F Week 3 - Mountaintop Views (2025).pdf
Liz Walsh-Trevino
PUBH1000 Module 3: Public Health Systems
PUBH1000 Module 3: Public Health SystemsPUBH1000 Module 3: Public Health Systems
PUBH1000 Module 3: Public Health Systems
Jonathan Hallett
The Battle of Belgrade Road: A WW1 Street Renaming Saga by Amir Dotan
The Battle of Belgrade Road: A WW1 Street Renaming Saga by Amir DotanThe Battle of Belgrade Road: A WW1 Street Renaming Saga by Amir Dotan
The Battle of Belgrade Road: A WW1 Street Renaming Saga by Amir Dotan
History of Stoke Newington
How to Setup WhatsApp in Odoo 17 - Odoo 際際滷s
How to Setup WhatsApp in Odoo 17 - Odoo 際際滷sHow to Setup WhatsApp in Odoo 17 - Odoo 際際滷s
How to Setup WhatsApp in Odoo 17 - Odoo 際際滷s
Celine George
How to Configure Restaurants in Odoo 17 Point of Sale
How to Configure Restaurants in Odoo 17 Point of SaleHow to Configure Restaurants in Odoo 17 Point of Sale
How to Configure Restaurants in Odoo 17 Point of Sale
Celine George
How to Modify Existing Web Pages in Odoo 18
How to Modify Existing Web Pages in Odoo 18How to Modify Existing Web Pages in Odoo 18
How to Modify Existing Web Pages in Odoo 18
Celine George
DUBLIN PROGRAM DUBLIN PROGRAM DUBLIN PROGRAM
DUBLIN PROGRAM DUBLIN PROGRAM DUBLIN PROGRAMDUBLIN PROGRAM DUBLIN PROGRAM DUBLIN PROGRAM
DUBLIN PROGRAM DUBLIN PROGRAM DUBLIN PROGRAM
vlckovar
Research & Research Methods: Basic Concepts and Types.pptx
Research & Research Methods: Basic Concepts and Types.pptxResearch & Research Methods: Basic Concepts and Types.pptx
Research & Research Methods: Basic Concepts and Types.pptx
Dr. Sarita Anand
How to use Init Hooks in Odoo 18 - Odoo 際際滷s
How to use Init Hooks in Odoo 18 - Odoo 際際滷sHow to use Init Hooks in Odoo 18 - Odoo 際際滷s
How to use Init Hooks in Odoo 18 - Odoo 際際滷s
Celine George
Modeling-Simple-Equation-Using-Bar-Models.pptx
Modeling-Simple-Equation-Using-Bar-Models.pptxModeling-Simple-Equation-Using-Bar-Models.pptx
Modeling-Simple-Equation-Using-Bar-Models.pptx
maribethlacno2
TLE 7 - 2nd Topic - Codes and Standards in Industrial Arts Services.pptx
TLE 7 - 2nd Topic - Codes and Standards in Industrial Arts Services.pptxTLE 7 - 2nd Topic - Codes and Standards in Industrial Arts Services.pptx
TLE 7 - 2nd Topic - Codes and Standards in Industrial Arts Services.pptx
RizaBedayo
Storytelling instructions...............
Storytelling instructions...............Storytelling instructions...............
Storytelling instructions...............
Alexander Benito
QuickBooks Desktop to QuickBooks Online How to Make the Move
QuickBooks Desktop to QuickBooks Online  How to Make the MoveQuickBooks Desktop to QuickBooks Online  How to Make the Move
QuickBooks Desktop to QuickBooks Online How to Make the Move
TechSoup
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
heathfieldcps1
FESTIVAL: SINULOG & THINGYAN-LESSON 4.pptx
FESTIVAL: SINULOG & THINGYAN-LESSON 4.pptxFESTIVAL: SINULOG & THINGYAN-LESSON 4.pptx
FESTIVAL: SINULOG & THINGYAN-LESSON 4.pptx
DanmarieMuli1
Lesson Plan M1 2024 Lesson Plan M1 2024 Lesson Plan M1 2024 Lesson Plan M1...
Lesson Plan M1 2024  Lesson Plan M1 2024  Lesson Plan M1 2024  Lesson Plan M1...Lesson Plan M1 2024  Lesson Plan M1 2024  Lesson Plan M1 2024  Lesson Plan M1...
Lesson Plan M1 2024 Lesson Plan M1 2024 Lesson Plan M1 2024 Lesson Plan M1...
pinkdvil200
The Constitution, Government and Law making bodies .
The Constitution, Government and Law making bodies .The Constitution, Government and Law making bodies .
The Constitution, Government and Law making bodies .
saanidhyapatel09
APM People Interest Network Conference - Oliver Randall & David Bovis - Own Y...
APM People Interest Network Conference - Oliver Randall & David Bovis - Own Y...APM People Interest Network Conference - Oliver Randall & David Bovis - Own Y...
APM People Interest Network Conference - Oliver Randall & David Bovis - Own Y...
Association for Project Management
Principle and Practices of Animal Breeding || Boby Basnet
Principle and Practices of Animal Breeding || Boby BasnetPrinciple and Practices of Animal Breeding || Boby Basnet
Principle and Practices of Animal Breeding || Boby Basnet
Boby Basnet

Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview

  • 1. HDFS, Map Reduce & Hadoop 1.0 Vs 2.0 Overview HDFS Architecture HDFS stands for Hadoop Distributed File System HDFS was originally built as infrastructure forthe Apache Nutch web search engine project HDFS is now an Apache Hadoop sub project A typical file in HDFS is gigabytes to terabytes in size HDFS applications need a write-once-read-many access model for files. This assumption simplifies data coherency issues and enables high throughput data access HDFS has master/slave architecture: NameNode/Datanode An HDFS cluster consists of a single NameNode and a number of Datanode The NameNode and Datanode are pieces of softwaredesigned to run on commodity machines. These machines typically run a GNU/Linux operating system (OS) Datanode, usually one per node in the cluster, whichmanage storage attached to the nodes that they run on. They are responsible for serving read and write requests from the file systems clients. They also perform blockcreation, deletion, and replication upon instruction fromthe NameNode. HDFS exposes a file system namespace and allows user data tobe stored in files. Internally, a file is split into one or more blocks and these blocksare stored in a set of Datanode NameNode Namenode holds the Meta data for the HDFS like Namespace information, block information etc. When in use, all this information is stored in main memory. But this information also stored in disk for persistence storage.
  • 2. The above image shows how Name Node stores information in disk. Twodifferent files are fsimage - Its the snapshot of the file system when Namenode started Edit logs - Its the sequence of changes made tothe file system after Namenode started Only in the restart of Namenode, edit logs are applied to fsimage to get the latest snapshot of the file system. But Namenode restart are rare in production clusters which means edit logs can grow very large for the clusters where Namenode runs for a long period of time. The following issues we will encounter in this situation. Edit log become very large , which willbe challenging to manage it Namenode restart takes long time because lot of changes has to be merged In the case of crash, we willlost huge amount of metadata since fsimage is very old So to overcome this issues we need a mechanism which will help us reduce the edit log size which is manageable and have up to date fsimage ,so that load on Namenode reduces . Its very similar to Windows Restore point, which will allow us to take snapshot of the OS so that if something goes wrong, we can fall back to the last restore point. So now we understood NameNode functionality and challenge to keep the Meta data up to date. So what is this all have to withSecondary Namenode? Secondary Namenode Secondary Namenode helps to overcomethe aboveissues by taking over responsibility of merging edit logs withfsimage from the Namenode. The above figure shows the workingof Secondary Namenode It gets the edit logs from the Namenode in regular intervals and applies to fsimage Once it has new fsimage, it copies back to Namenode Namenode will use this fsimage for the next restart, whichwill reduce the start-up time
  • 3. Secondary Namenode whole purpose is to have a checkpoint in HDFS. Its just a helper node for Namenode. Thats why it also known as checkpoint node inside the community. So we now understood all Secondary Namenode does put a checkpoint in file system which will help Namenode to function better. Its not the replacement or backup for the Namenode. So from now on make a habit of calling it as a checkpoint node. MapReduce Mapreduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner. Mapreduce is a processing technique and a program model for distributed computing based on java. The Mapreduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name Mapreduce implies, the reduce task is always performed after the map job. Mapreduce program executes in three stages, namely map stage, shuffle stage, and reduce stage. Map stage: The map or mappers job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data. Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage. The Reducers job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS. During a Mapreduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster. The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes. Most of the computing takes place on nodes with data on local disks that reduces the network traffic. After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.
  • 4. Timelines Year Month Event 2003 October Google File System paper released 2006 January Hadoop is born from Nutch 197 2006 February Hadoop is named after Cutting's son's yellow plush toy 2006 April Hadoop 0.1.0 released 2006 April Hadoop sorts 1.8 TB on 188 nodes in 47.9 hours 2008 March First Hadoop Summit 2008 April Hadoop world record fastest system to sort a terabyte of data. Running on a 910- node cluster, Hadoop sorted one terabyte in 209 seconds 2008 May Hadoop wins TeraByte Sort (World Record sortbenchmark.org) 2008 July Hadoop wins Terabyte Sort Benchmark 2008 November Google MapReduce implementation sorted one terabyte in 68 seconds 2009 May Yahoo! used Hadoop to sort one terabyte in 62 seconds 2012 November Apache Hadoop 1.0 Available Hadoop1VsHadoop2
  • 5. S No Hadoop1 Hadoop2 2 MR does both processing and cluster- resource management. YARN (YetAnother Resource Negotiator) does cluster resource management and processing is done using different processing models. 3 Has limited scaling of nodes. Limited to 4000 nodes per cluster Has better scalability. Scalable up to 10000 nodes per cluster 4 Works on concepts of slots slots can run either a Map task or a Reduce task only. Works on concepts of containers. Using containers can run generic tasks. 5 A single Namenode to manage the entire namespace. Multiple Namenode servers manage multiple namespaces. 6 Has Single-Point-of-Failure (SPOF) because of single Namenode- and in the case of Namenode failure, needs manual intervention to overcome. Has to feature to overcomeSPOF witha standby Namenode and in the case of Namenode failure, it is configured forautomatic recovery. 7 MR API is compatible withHadoop1x. A program written in Hadoop1 executes in Hadoop1x without any additional files. MR API requires additional files for a program written in Hadoop1x to execute in Hadoop2x. 9 A Namenode failure affectsthe stack. The Hadoop stack Hive, Pig, HBase etc. are all equipped to handle Namenode failure.