Big data refers to large volumes of data that are diverse in type and are produced rapidly. It is characterized by the V's: volume, velocity, variety, veracity, and value. Hadoop is an open-source software framework for distributed storage and processing of big data across clusters of commodity servers. It has two main components: HDFS for storage and MapReduce for processing. Hadoop allows for the distributed processing of large data sets across clusters in a reliable, fault-tolerant manner. The Hadoop ecosystem includes additional tools like HBase, Hive, Pig and Zookeeper that help access and manage data. Understanding Hadoop is a valuable skill as many companies now rely on big data and Hadoop technologies.
4. Big Data-What does it mean?
Velocity:
Often time sensitive , big data must be used
as it is streaming in to the enterprise it order
to maximize its value to the business.
Batch ,Near time , Real-time ,streams
Volume:
Big data comes in one size : large .
Enterprises are awash with data ,easy
amassing terabytes and even petabytes of
information.
TB , Records , Transactions ,Tables , Files.
Variety:
Big data extends beyond structured data
, including semi-structured and unstructured
data to all varieties :text , audio , video ,click
streams ,log files and more
Structured , Unstructured , Semi-structured
Veracity:
Quality and provenance of received data.
Good , Undefined , bad , Inconsistency
, Incompleteness , Ambiguity
Value
6. What is Hadoop?
Software project that enables the distributed processing of large data sets across clusters of
commodity servers
Works with structured and unstructured data
Open source software + Hardware commodity = IT cost Reduction
It is designed to scale up from a single server to thousands of machines
Very high degree of fault tolerance softwares ability to detect and handle failures at the application
layer
7. The origin of the name Hadoop.
The name Hadoop is not an acronym; its a
made-up name. The projects creator, Doug
Cutting, explains how the name came about:
The name my kid gave a stuffed yellow
elephant. Short, relatively easy to spell and
pronounce, meaningless, and not used
elsewhere: those are my naming criteria.
Kids are good at generating such. Googol is
a kids term.
9. HDFS-Hadoop Distributed File System
Distributed, scalable, and portable file system
Each node in a Hadoop instance typically has
a single Namenode : a cluster of Datanodes
form the HDFS cluster
Asynchronous replication.
Data divided into 64mb (default) or 128mb
blocks , each block replicated 3 times (default)
Namenode holds file system metadata.
Files are broken up and spread over Datanode
.
15. HBase
HBase is an open source , non-relational, distributed database
A Key-value store
A value is identified by the key
Both key and value are a byte array
The values are stored in key-order
Thus access data by key is very fast
Users create table in HBase
There is no schema of HBase table
Very good for sparse data
Takes lots of disk space
16. HBase Architecture
Master: Responsible for coordinating with region server.
Region server: Serves data for read and write
Zookeeper: Manages the HBase cluster
Low latency and random access to data
17. Hive
A system for managing and querying structured data built on Hadoop
SQL-Like query language called HQL
Main purpose is analysis and ad hoc querying
Database/table/partition DDL operation
Not for :small data sets ,Low latency queries ,OLTP
21. Hadoop Limitations
Not a high-speed SQL database.
Is not a particularly simple technology.
Hadoop is not easy to connect to legacy systems.
Hadoop is not a replacement for traditional data warehouses. It is an
adjunctive product to data warehouses.
Normal DBAs will need to learn new skills before they can adopt
Hadoop tools.
The architecture around the data - the way you store data, the way
you de-normalize data, the way you ingest data, the way you extract
data - is different in Hadoop.
Linux and Java skills are critical for making a Hadoop environment a
reality.
22. Hadoops Capability
Hadoop is a super-powerful environment that can transform your
understanding of data.
Hadoop can store vast amounts of data.
Hadoop can run queries on huge data sets.
You can archive data on Hadoop and still query it.
Hadoop allows you to ingest data at incredible speeds and analyze it and
report on it in near real-time.
Hadoop massively reduces the latency of data.
23. Hadoop: Hot skill to acquire on IT job
circuit
The market for data technologies, such as databases, is a multi-billion dollar industry.
Many start-ups are working on technology extensions to Hadoop to make it both analytical
and transactional. That would be big.
Major companies have a big data strategy and want to build their businesses on top of this
Google, the originator of Hadoop, has already moved on suggesting that within a decade
either the Hadoop framework will have to be developed beyond all recognition or that
something newer could be on the way to supplant it.
Every major internet company - be it Google, Twitter, Linkedin or Facebook - uses some form
of Hadoop .