This document provides an introduction to Hadoop and big data. It defines big data as large amounts of data from a variety of structured, semi-structured, and unstructured sources that is difficult to store, analyze, and visualize due to its volume, velocity, and variety. Hadoop is introduced as an open source framework for distributed processing and storage of large datasets across clusters of commodity hardware. Key Hadoop components like HDFS, MapReduce, YARN and daemons like NameNode, DataNode, ResourceManager and NodeManager are described. Modes of operation for Hadoop including standalone, pseudo-distributed and fully distributed are also outlined.
3. What is Big Data??
Large amount of Data .
Its a popular term used to express exponential growth of
data .
Big data is difficult to store , collect , maintain , Analyze
and Visualize .
4/11/2017Footer Text 3
4. Big Data characteristics
Volume :-
Large amount of data .
Velocity :-
The rate at which data is getting generated
Variety :-
Different types of Data
- Structured data ,eg MySql
- Semi-Structured data, eg xml , json
- Unstructured data, eg text , audio, video
4/11/2017Footer Text 4
5. Big Data sources
Social Media
Banks
Instruments
Websites
Stock Market
4/11/2017Footer Text 5
6. Use cases of Big Data
Recommendation engines
Analyzing Call Detail Record(CDR)
Fraud Detection
Market Basket Analysis
Sentimental Analysis
4/11/2017Footer Text 6
7. Hadoop Introduction
Open source framework that allows distributed
processing of large datasets on the cluster of commodity
hardware
Hadoop is a data management tool and uses scale out
storage .
4/11/2017Footer Text 7
8. Defining Hadoop Cluster
Size of data is most important factor while defining
hadoop cluster
4/11/2017Footer Text 8
5 Servers with 10 TB storage
capacity each
Total Storage Capacity : - 50TB
14. Hadoop Cluster
Assume that we have hadoop cluster with 4 nodes
4/11/2017Footer Text 14
Master
NameNode
ResourceManager
Slave
DataNode
NodeManager
15. Secondary Name Node
Secondary Namenode is not a hot backup for Namenode
.
It just takes hourly backup of Namenode metadata
It is can be used to Restart a crashed Hadoop Cluster
Secondary Namenode is an important demon for
Hadoop1 , However in hadoop2 It is not that much
Important .
4/11/2017Footer Text 15
16. Modes of Operation
Stand Alone
Pseudo Distributed
Fully Distributed
4/11/2017Footer Text 16