ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Hadoop Technologies Kannappan S
Hadoop Technologies Google   Google File System MapReduce Sawzall BigTable Google Open Source HDFS Hadoop MapReduce Pig / Hive HBase Open source communities,Yahoo, facebook, Cloudera, twitter and LinkedIn
Other Players Amazon File System : Amazon S3 Instances : Amazon EC2 Cluster Platform : Hadoop Microsoft Dryad (Distributed Runtime) DryadLINQ (High level lang)
HDFS Storage : Large files stored across multiple machines Reliability : Data replicated across multiple hosts Replication : Default replication value 3.  Data is stored on three nodes: two on the same rack, and one on a different rack
MapReduce Framework inspired by map, reduce functional contructs in LISP Classic paper :  http://labs.google.com/papers/mapreduce.html Hadoop Map Reduce Pig / Hive
Apache Pig Pig Latin : High level language Pig Engine : Compiles Pig code to Map Reduce Nested Data Model (Atom, Tuple, Bag) UDFs are first class citizens
Apache Pig good_urls = FILTER urls BY pagerank > 0.2;  groups = GROUP good_urls BY category;  big_groups = FILTER groups BY COUNT(good_urls)>106 ;  output = FOREACH big_groups GENERATE  category, AVG(good_urls.pagerank)
Flume Log Collection Flume is a distributed, reliable, and available service for efficiently moving large amounts of data as the data is produced. MapQuest Log Processing Example 100's of prod servers, 5 Log processing machines, 1 Netezza Data Warehouse
Flume Architecture
Other Technologies Oozie ¨C Yahoo!¡¯s workflow engine for Hadoop Open-source workflow solution to manage and coordinate jobs running on Hadoop, including HDFS, Pig and MapReduce. Zookeeper ¨C ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchal name space of data registers (znodes), much like a file system.  HBase ¨C HBase is an open source, non-relational, distributed database modeled after Google's BigTable and runs on top of HDFS. It provides a fault-tolerant way of storing large quantities of sparse data. Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop. (Powerset) Hive ¨C SQL like interface (Jeff Hammerbacher)

More Related Content

What's hot (19)

PPTX
Apache hadoop technology : Beginners
Shweta Patnaik
?
PDF
Introduction to Hadoop
joelcrabb
?
PPTX
HADOOP TECHNOLOGY ppt
sravya raju
?
PPTX
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
?
PDF
Hadoop Ecosystem
Sandip Darwade
?
PPTX
Hadoop overview
Siva Pandeti
?
PPTX
Hadoop
Shamama Kamal
?
PPTX
Hadoop And Their Ecosystem
sunera pathan
?
PPT
An Introduction to Hadoop
DerrekYoungDotCom
?
ODP
Hadoop demo ppt
Phil Young
?
PDF
Introduction to Big Data & Hadoop
Edureka!
?
PPTX
Overview of Big data, Hadoop and Microsoft BI - version1
Thanh Nguyen
?
KEY
Intro to Hadoop
jeffturner
?
PPTX
Hadoop overview
Deborah Akuoko
?
PPTX
Hadoop Presentation - PPT
Anand Pandey
?
PDF
Introduction to Hadoop part1
Giovanna Roda
?
DOCX
Hadoop Seminar Report
Atul Kushwaha
?
PPTX
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
?
Apache hadoop technology : Beginners
Shweta Patnaik
?
Introduction to Hadoop
joelcrabb
?
HADOOP TECHNOLOGY ppt
sravya raju
?
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
?
Hadoop Ecosystem
Sandip Darwade
?
Hadoop overview
Siva Pandeti
?
Hadoop And Their Ecosystem
sunera pathan
?
An Introduction to Hadoop
DerrekYoungDotCom
?
Hadoop demo ppt
Phil Young
?
Introduction to Big Data & Hadoop
Edureka!
?
Overview of Big data, Hadoop and Microsoft BI - version1
Thanh Nguyen
?
Intro to Hadoop
jeffturner
?
Hadoop overview
Deborah Akuoko
?
Hadoop Presentation - PPT
Anand Pandey
?
Introduction to Hadoop part1
Giovanna Roda
?
Hadoop Seminar Report
Atul Kushwaha
?
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
?

Viewers also liked (9)

PPT
Hadoop Technology
Atul Kushwaha
?
PDF
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
?
PPTX
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
?
PPTX
Big data and Hadoop
Rahul Agarwal
?
PPTX
HADOOP TECHNOLOGY ppt
sravya raju
?
PPT
Computer network ppt
Santosh Delwar
?
PDF
Hadoop Overview & Architecture
EMC
?
PPT
Basic concepts of computer Networking
Hj Habib
?
PPT
Seminar Presentation Hadoop
Varun Narang
?
Hadoop Technology
Atul Kushwaha
?
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
?
Big data and Hadoop
Rahul Agarwal
?
HADOOP TECHNOLOGY ppt
sravya raju
?
Computer network ppt
Santosh Delwar
?
Hadoop Overview & Architecture
EMC
?
Basic concepts of computer Networking
Hj Habib
?
Seminar Presentation Hadoop
Varun Narang
?
Ad

Similar to Hadoop Technologies (20)

PPTX
BIG DATA: Apache Hadoop
Oleksiy Krotov
?
PPTX
Hadoop An Introduction
Mohanasundaram Ponnusamy
?
PDF
The hadoop ecosystem table
Mohamed Magdy
?
PPTX
Big Data Technology Stack : Nutshell
Khalid Imran
?
PPT
Taylor bosc2010
BOSC 2010
?
PDF
2.1-HADOOP.pdf
MarianJRuben
?
PDF
Big data overview of apache hadoop
veeracynixit
?
PDF
Big data overview of apache hadoop
veeracynixit
?
PDF
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
AyeeshaParveen
?
PPTX
Hadoop introduction
Chirag Ahuja
?
ODP
Hadoop introduction
¿û‘c Àî
?
PPTX
Big data concepts
Serkan ?zal
?
PDF
Hadoop overview.pdf
Sunil D Patil
?
PDF
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Cognizant
?
PPTX
Hadoop training
TIB Academy
?
PPTX
Intro to Hadoop
Jonathan Bloom
?
PPTX
Hadoop_arunam_ppt
jerrin joseph
?
PPTX
Hadoop basics
Laxmi Rauth
?
PPTX
Hadoop vs Apache Spark
ALTEN Calsoft Labs
?
DOCX
HDFS
Vardhman Kale
?
BIG DATA: Apache Hadoop
Oleksiy Krotov
?
Hadoop An Introduction
Mohanasundaram Ponnusamy
?
The hadoop ecosystem table
Mohamed Magdy
?
Big Data Technology Stack : Nutshell
Khalid Imran
?
Taylor bosc2010
BOSC 2010
?
2.1-HADOOP.pdf
MarianJRuben
?
Big data overview of apache hadoop
veeracynixit
?
Big data overview of apache hadoop
veeracynixit
?
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
AyeeshaParveen
?
Hadoop introduction
Chirag Ahuja
?
Hadoop introduction
¿û‘c Àî
?
Big data concepts
Serkan ?zal
?
Hadoop overview.pdf
Sunil D Patil
?
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Cognizant
?
Hadoop training
TIB Academy
?
Intro to Hadoop
Jonathan Bloom
?
Hadoop_arunam_ppt
jerrin joseph
?
Hadoop basics
Laxmi Rauth
?
Hadoop vs Apache Spark
ALTEN Calsoft Labs
?
Ad

Hadoop Technologies

  • 2. Hadoop Technologies Google Google File System MapReduce Sawzall BigTable Google Open Source HDFS Hadoop MapReduce Pig / Hive HBase Open source communities,Yahoo, facebook, Cloudera, twitter and LinkedIn
  • 3. Other Players Amazon File System : Amazon S3 Instances : Amazon EC2 Cluster Platform : Hadoop Microsoft Dryad (Distributed Runtime) DryadLINQ (High level lang)
  • 4. HDFS Storage : Large files stored across multiple machines Reliability : Data replicated across multiple hosts Replication : Default replication value 3. Data is stored on three nodes: two on the same rack, and one on a different rack
  • 5. MapReduce Framework inspired by map, reduce functional contructs in LISP Classic paper : http://labs.google.com/papers/mapreduce.html Hadoop Map Reduce Pig / Hive
  • 6. Apache Pig Pig Latin : High level language Pig Engine : Compiles Pig code to Map Reduce Nested Data Model (Atom, Tuple, Bag) UDFs are first class citizens
  • 7. Apache Pig good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)>106 ; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank)
  • 8. Flume Log Collection Flume is a distributed, reliable, and available service for efficiently moving large amounts of data as the data is produced. MapQuest Log Processing Example 100's of prod servers, 5 Log processing machines, 1 Netezza Data Warehouse
  • 10. Other Technologies Oozie ¨C Yahoo!¡¯s workflow engine for Hadoop Open-source workflow solution to manage and coordinate jobs running on Hadoop, including HDFS, Pig and MapReduce. Zookeeper ¨C ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchal name space of data registers (znodes), much like a file system. HBase ¨C HBase is an open source, non-relational, distributed database modeled after Google's BigTable and runs on top of HDFS. It provides a fault-tolerant way of storing large quantities of sparse data. Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop. (Powerset) Hive ¨C SQL like interface (Jeff Hammerbacher)