The document discusses the motivation for Hadoop, including that while disk capacity and RAM have increased significantly over time, disk read/write speeds have not improved as much, necessitating parallel reads and writes. It introduces Hadoop Distributed File System (HDFS) and MapReduce as a solution for parallel processing of large datasets across clusters of machines. HDFS provides a big virtual file system, while MapReduce allows computation over sets of keys and values to abstract from disk read/write.
2. Hadoop Motivation
? HW improvements through the years ¡
¨C CPU Speeds: 40 MIPS (1990) -> 50 GIPS (2010) =>
1,250x
¨C RAM Memory: 640kB (1990) -> 8GB (2010) => 12,500x
¨C Disk Capacity & Cost: 40 MB (1990 for $400) -> 1 TB
(2010 for $100) => 25,000x
? What about disk read speed / disk latency?
¨C 4.4 MB/s in 1990
¨C 100 MB/s in 2010 => JUST 25x faster
¨C => parallel read from multiple-disks
¨C it¡¯s not just about reads, but parallel writes as well
3. Hadoop motivation - issues
? Parallel reads and writes brings challenges ¡
¨C Hardware failure
? Disks failure => replication? => RAID?
¨C Data combination
? Combining data from disks
? Solution ¡ HADOOP
¨C Hadoop Distributed File System (HDFS)
¨C MapReduce programming model ¨C analysis system
? Abstracts from disk R/W to computation over sets of keys
and values
6. Hadoop X Relational DB
Relational DB MapReduce
Data size GBs TBs / PBs
Access Interactive /
Batch
Batch
Updates Read & Write Write once / multiple reads
Structure Static schema Dynamic schema (analyst chooses it)
Integrity High Low
Scaling Non linear Linear
7. Hadoop 1 vs Hadoop 2
? Hadoop 1 SPOF
NameNode
? Security
? Hadoop 2 promotes
cluster to ¡°universal
computational cluster¡±
? Removes bottlenecks in
Map-Reduce
11. HBase features
? NoSQL database
? Column oriented DB
? Google¡¯s BigTable implementation
? Linear and modular scalability
? Strictly consistent reads and writes.
? Automatic and configurable
sharding of tables
? Automatic failover support between
RegionServers.
? Convenient base classes for backing
Hadoop MapReduce jobs with
Apache HBase tables.
? Easy to use Java API for client
access.
? Block cache and Bloom Filters for
real-time queries.
? Query predicate push down via
server side Filters
? Thrift gateway and a REST-ful Web
service that supports XML, Protobuf,
and binary data encoding options