際際滷

際際滷Share a Scribd company logo
Big Data
Presented by : SHIVAM SHUKLA
Contents
 What is Big data ?
 History
 Three Vs
 Why Big Data important ?
 Technologies related to Big Data
Hadoop
Why Hadoop?
Hbase
Why Hbase?
Some features of Hbase
Hive
About
Points to remember
Sqoop
Working
Difference
What is Big Data ?
 Big data is a term that describes the large volume of data :
a) Structured
b) Unstructured
c) Semi-structured
 That inundates a business on a day-to-day basis.
 But its not the amount of data thats important. Its what
organizations do with the data that matters.
History
 While the term big data is relatively new, the act of gathering and
storing large amounts of information for eventual analysis is ages
old.
 The concept gained momentum in the early 2000s, when industry
analyst Doug Laney articulated the now-mainstream definition of
big data as the three Vs:
Volume
Velocity
Variety
Three Vs :
 Volume
Defines the huge amount of data that is produced each day by
organizations in the world
 Velocity
Refers to speed with which the data is generated , analyzed and
reprocessed
 Variety
refers to diversity of data and data sources
Big data and tools
Additional Vs
With the time new Vs of big data introduced
 Validity
It refers to the guarantee of data quality or,
alternatively, Veracity is the authenticity and credibility of the data.
 Value
denotes the added value for companies. Many companies have
recently established their own data platforms, filled their data pools
and invested a lot of money in infrastructure. It is now a question of
generating business value from their investments.
Why is Big Data important ?
 The importance of big data doesnt revolve around how much data
you have, but what you do with it.
 You can take data from any source and analyze it to find answers
that enable
Cost reduction
Time reduction
Smart decision making
Some Technologies related to Big
data
 Hadoop framework
 Hbase
 Hive
 Scoop
Hadoop
 Hadoop is developed by Doug cutting and Michael j. cafarella.
 Hadoop is a apache open source frame work designed for
Managing the data
Processing the data
Analyzing the data
Storing the data
 Hadoop is written in java and not OLAP(online analytical
processing).
 It is used for offline processing.
 Logo for Hadoop is a YELLOW ELEPHANT
Why Hadoop ?
 Fast :
 In HDFS the data distributed over the cluster and are mapped
which helps in faster retrieval.
 Scalable :
 Hadoop cluster can be extended by just adding nodes in the
cluster.
 Cost Effective :
 Hadoop is open source and uses commodity hardware to store
data so it really cost effective as compared to traditional
relational database management system.
 Resilient to failure :
 HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure
happens, then Hadoop takes the other copy of data and use it.
HBase
 HBase is an open source framework provided by Apache. It is a
sorted map data built on Hadoop.
 It is column oriented and horizontally scalable.
 It has set of tables which keep data in key value format.
 It is type of a database designed for mainly managing the
unstructured data
 Logo for Apache HBase is a DOLPHIN
Why Hbase?
 RDBMS get exponentially slow as the data becomes large.
 Expects data to be highly structured, i.e. ability to fit in a well-
defined schema.
 Any change in schema might require a downtime.
 For sparse datasets, too much of overhead of maintaining NULL
values.
Some feature of
Hbase
 Horizontally scalable: You can add any number of columns anytime.
 Often referred as a key value store or column family-oriented
database, or storing versioned maps of maps.
 fundamentally, it's a platform for storing and retrieving data with
random access.
 It doesn't care about datatypes(storing an integer in one row and a
string in another for the same column).
 There is only one kind of data type which is byte array.
 It doesn't enforce relationships within your data.
 It is designed to run on a cluster of computers.
Hive
 Hive is a data warehouse infrastructure tool to process structured
data in Hadoop.
 It runs SQL like queries called HQL (Hive query language) which
gets internally converted to map reduce jobs.
 Initially Hive was developed by Facebook, later the Apache
Software Foundation took it up and developed it further as an open
source under the name Apache Hive.
 Hive supports Data definition Language(DDL), Data Manipulation
Language(DML) and user defined functions.
 The logo for hive is a yellow and black BEE
Hive is not :
 A relational database
 designed for Online Transaction Processing (OLTP)
 A language for real-time queries and row-level updates
 Even with small amount of data ,time to return the response cant be
compared to RDBMS.
Points to remember about
hive
 Hive Query Language is similar to SQL and gets reduced to map
reduce jobs in backend.
 Hive's default database is derby.
 It also called as a No Sql.
 It provides SQL type language for querying called HiveQL or HQL.
 It is designed for OLAP(Online analytics processing).
Sqoop
 Sqoop is a tool designed to transfer data between Hadoop and
relational database servers.
 It is used to import data from relational databases such as MySQL,
Oracle to Hadoop HDFS, and export from Hadoop file system to
relational databases.
 It is provided by the Apache Software Foundation.
 Sqoop- SQL to Hadoop and Hadoop to SQL
Working of sqoop
Difference
Sqoop Import
 The import tool imports
individual tables from
RDBMS to HDFS.
 Each row in a table is treated
as a record in HDFS.
 All records are stored as text
data in text files or as binary
data in Avro and Sequence
files.
Sqoop Export
 The export tool exports a set of
files from HDFS back to an
RDBMS.
 The files given as input to
Sqoop contain records, which
are called as rows in table.
 Those are read and parsed into
a set of records and delimited
with user-specified delimiter.
Thank you
Any queries

More Related Content

Big data and tools

  • 1. Big Data Presented by : SHIVAM SHUKLA
  • 2. Contents What is Big data ? History Three Vs Why Big Data important ? Technologies related to Big Data Hadoop Why Hadoop? Hbase Why Hbase? Some features of Hbase
  • 4. What is Big Data ? Big data is a term that describes the large volume of data : a) Structured b) Unstructured c) Semi-structured That inundates a business on a day-to-day basis. But its not the amount of data thats important. Its what organizations do with the data that matters.
  • 5. History While the term big data is relatively new, the act of gathering and storing large amounts of information for eventual analysis is ages old. The concept gained momentum in the early 2000s, when industry analyst Doug Laney articulated the now-mainstream definition of big data as the three Vs: Volume Velocity Variety
  • 6. Three Vs : Volume Defines the huge amount of data that is produced each day by organizations in the world Velocity Refers to speed with which the data is generated , analyzed and reprocessed Variety refers to diversity of data and data sources
  • 8. Additional Vs With the time new Vs of big data introduced Validity It refers to the guarantee of data quality or, alternatively, Veracity is the authenticity and credibility of the data. Value denotes the added value for companies. Many companies have recently established their own data platforms, filled their data pools and invested a lot of money in infrastructure. It is now a question of generating business value from their investments.
  • 9. Why is Big Data important ? The importance of big data doesnt revolve around how much data you have, but what you do with it. You can take data from any source and analyze it to find answers that enable Cost reduction Time reduction Smart decision making
  • 10. Some Technologies related to Big data Hadoop framework Hbase Hive Scoop
  • 11. Hadoop Hadoop is developed by Doug cutting and Michael j. cafarella. Hadoop is a apache open source frame work designed for Managing the data Processing the data Analyzing the data Storing the data Hadoop is written in java and not OLAP(online analytical processing). It is used for offline processing. Logo for Hadoop is a YELLOW ELEPHANT
  • 12. Why Hadoop ? Fast : In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval. Scalable : Hadoop cluster can be extended by just adding nodes in the cluster. Cost Effective : Hadoop is open source and uses commodity hardware to store data so it really cost effective as compared to traditional relational database management system. Resilient to failure : HDFS has the property with which it can replicate data over the network, so if one node is down or some other network failure happens, then Hadoop takes the other copy of data and use it.
  • 13. HBase HBase is an open source framework provided by Apache. It is a sorted map data built on Hadoop. It is column oriented and horizontally scalable. It has set of tables which keep data in key value format. It is type of a database designed for mainly managing the unstructured data Logo for Apache HBase is a DOLPHIN
  • 14. Why Hbase? RDBMS get exponentially slow as the data becomes large. Expects data to be highly structured, i.e. ability to fit in a well- defined schema. Any change in schema might require a downtime. For sparse datasets, too much of overhead of maintaining NULL values.
  • 15. Some feature of Hbase Horizontally scalable: You can add any number of columns anytime. Often referred as a key value store or column family-oriented database, or storing versioned maps of maps. fundamentally, it's a platform for storing and retrieving data with random access. It doesn't care about datatypes(storing an integer in one row and a string in another for the same column). There is only one kind of data type which is byte array. It doesn't enforce relationships within your data. It is designed to run on a cluster of computers.
  • 16. Hive Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It runs SQL like queries called HQL (Hive query language) which gets internally converted to map reduce jobs. Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive. Hive supports Data definition Language(DDL), Data Manipulation Language(DML) and user defined functions. The logo for hive is a yellow and black BEE
  • 17. Hive is not : A relational database designed for Online Transaction Processing (OLTP) A language for real-time queries and row-level updates Even with small amount of data ,time to return the response cant be compared to RDBMS.
  • 18. Points to remember about hive Hive Query Language is similar to SQL and gets reduced to map reduce jobs in backend. Hive's default database is derby. It also called as a No Sql. It provides SQL type language for querying called HiveQL or HQL. It is designed for OLAP(Online analytics processing).
  • 19. Sqoop Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases. It is provided by the Apache Software Foundation. Sqoop- SQL to Hadoop and Hadoop to SQL
  • 21. Difference Sqoop Import The import tool imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record in HDFS. All records are stored as text data in text files or as binary data in Avro and Sequence files. Sqoop Export The export tool exports a set of files from HDFS back to an RDBMS. The files given as input to Sqoop contain records, which are called as rows in table. Those are read and parsed into a set of records and delimited with user-specified delimiter.