1. The document discusses large-scale data storage and processing options for scientists in the Netherlands, focusing on Hadoop and its components HDFS and MapReduce.
2. HDFS provides a distributed file system that stores very large datasets across clusters of machines, while MapReduce allows processing of datasets in parallel across a cluster.
3. A case study is described that uses HDFS for storage of a 2.7TB text file and MapReduce for analyzing the data to study category evolution in Wikipedia articles over time.
1 of 24
Downloaded 549 times
More Related Content
Large-Scale Data Storage and Processing for Scientists with Hadoop
1. Large-Scale Data Storage and Processing
for Scientists in The Netherlands
Evert.Lammerts@SARA.nl
January 21, 2011
NBIC BioAssist Programmers' Day
4. Case Study:
Virtual Knowledge Studio
http://simshelf2.virtualknowledgestudio.nl/activities/biggrid-wikipedia-experiment
How do categories in WikiPedia
evolve over time? (And how do
they relate to internal links?)
2.7 TB raw text, single file
Java application, searches for
categories in Wiki markup,
like [[Category:NAME]]
Executed on the Grid
BioAssist Programmers' Day, January 21, 2011
5. 1.1) Copy file from local 2.1) Stream file from Grid 3.1) Process all files in
Machine to Grid storage Storage to single machine parallel: N machines
2.2) Cut into pieces of 10 GB run the Java application,
2.3) Stream back to Grid fetch a 10GB file as
Storage input, processing it, and
putting the result back
BioAssist Programmers' Day, January 21, 2011
6. Status Quo:
Arrange your own (Data-)parallelism
Cut the dataset up in processable chunks:
Size of chunk depending on local space on processing node...
on the total processing capacity available
on the smallest unit of work (largest grade of parallelism)...
on the substance (sometimes you don't want many output files, e.g.
when building a search index).
Submit the amount of jobs you consider necessary:
To a cluster close to your data (270x10GB over WAN is a bad idea)
Amount might depend on cluster capacity, amount of chunks, smallest
unit of work, substance...
When dealing with large data, let's say 100GB+, this is
ERROR PRONE, TIME CONSUMING AND NOT FOR NEWBIES!
BioAssist Programmers' Day, January 21, 2011
10. Hadoop Distributed File System (HDFS)
Very large DFS. Order of magnitude:
10k nodes
millions of files
PetaBytes of storage
Assumes commodity hardware:
redundancy through replication
failure handling and recovery
Optimized for batch processing:
locations of data exposed to computation
high aggregate bandwidth
BioAssist Programmers' Day, January 21, 2011
http://www.slideshare.net/jhammerb/hdfs-architecture
11. HDFS Continued...
Single Namespace for the entire cluster
Data coherency
Write-once-read-many model
Only appending is supported for existing files
Files are broken up in chunks (blocks)
Blocksizes ranging from 64 to 256 MB, depending on configuration
Blocks are distributed over nodes (a single FILE, existing of N
blocks, is stored on M nodes)
Blocks are replicated and replications are distributed
Client accesses the blocks of a file at the nodes directly
This creates high aggregate bandwidth!
BioAssist Programmers' Day, January 21, 2011
http://www.slideshare.net/jhammerb/hdfs-architecture
12. HDFS NameNode & DataNodes
NameNode DataNode
Manages File System Namespace A Block Server
Mapping filename to blocks Stores data in local FS
Mapping blocks to DataNode Stores metadata of a block (e.g. hash)
Cluster Configuration Serve (meta)data to clients
Replication Management Facilitates pipeline to other DN's
BioAssist Programmers' Day, January 21, 2011
13. http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html
Metadata operations
Communicate with NN only
ls (see above), lsr, df, du, chmod,
chown... etc.
R/W (block) operations
Communicate with NN and DN's
put, copyFromLocal, CopyToLocal,
tail... etc.
BioAssist Programmers' Day, January 21, 2011
14. HDFS
Application Programming Interface (API)
Enables programmers to access any HDFS from their code
Described at
http://hadoop.apache.org/common/docs/r0.20.0/api/index.html
Written in (and thus available for) Java, but...
Is also exposed through Apache Thrift, so can be accessed
from:
C++, Python, PHP, Ruby, and others
See http://wiki.apache.org/hadoop/HDFS-APIs
Has a separate C-API (libhdfs)
So: you can enable your services to work with HDFS
BioAssist Programmers' Day, January 21, 2011
15. MapReduce
Is a framework for distributed (parallel) processing
of large datasets
Provides a programming model
Lets users plug-in own code
Uses a common pattern:
cat|grep|sort|unique>file
input|map|shuffle|reduce>output
Is useful for large scale data analytics and
processing
BioAssist Programmers' Day, January 21, 2011
16. MapReduce Continued...
Is great for processing large datasets!
Send computation to data, so little data over lines
Uses blocks stored in the DFS, so no splitting required
(this is a bit more sophisticated depending on your input)
Handles parallelism for you
One map per block, if possible
Scales basically linearly
time_on_cluster = time_on_single_core / total_cores
Java, but streaming possible (plus others, see later)
BioAssist Programmers' Day, January 21, 2011
17. MapReduce
JobTracker & TaskTrackers
JobTracker TaskTracker
Holds job metadata Requests work from JT
Status of job Fetch the code to execute from the DFS
Status of Tasks running on TTs Apply job-specific configuration
Decides on scheduling Communicate with JT on tasks:
Delegates creation of 'InputSplits' Sending output, Killing tasks, Task updates, etc
BioAssist Programmers' Day, January 21, 2011
19. MapReduce
Application Programming Interface (API)
Enables programmers to write MapReduce jobs
More info on MR jobs:
http://www.slideshare.net/evertlammerts/infodoc-6107350
Enables programmers to communicate with a
JobTracker
Submitting jobs, getting statuses, cancelling jobs,
etc
Described at
http://hadoop.apache.org/common/docs/r0.20.0/api/index.html
BioAssist Programmers' Day, January 21, 2011
20. Case Study:
Virtual Knowledge Studio
1) Load file into 2) Submit code
HDFS to MR
BioAssist Programmers' Day, January 21, 2011
21. What's more on Hadoop?
Lots!
Apache Pig http://pig.apache.org
Analyze datasets in a high level language, Pig Latin
Simple! SQL like. Extremely fast experiments.
N-stage jobs (MR chaining!)
Apache Hive http://hive.apache.org
Data Warehousing
Hive QL
Apache Hbase http://hbase.apache.org
BigTable implementation (Google)
In-memory operation
Performance good enough for websites (Facebook built its Messaging Platform on top of it)
Yahoo! Oozie http://yahoo.github.com/oozie/
Hadoop workflow engine
Apache [AVRO | Chukwa | Hama | Mahout] and so on
3rd Party:
ElephantBird
Cloudera's Distribution for Hadoop
Hue
Yahoo's Distribution for Hadoop
BioAssist Programmers' Day, January 21, 2011
22. Hadoop @ SARA
A prototype cluster
Since December 2010
20 cores for MR (TT's)
110 TB gross for HDFS (DN) (55TB net)
Hue web-interface for job submission & management
SFTP interface to HDFS
Pig 0.8
Hive
Available for scientists / scientific programmers until May / June 2011
Towards a production infrastructure?
Depending on results
It's open for you all as well: ask me for an account!
BioAssist Programmers' Day, January 21, 2011
24. Hadoop for:
Large-scale data storage and processing
Fundamental difference: data locality!
Small files? Don't, but... Hadoop Archives (HAR)
Archival? Don't. Use tape storage. (We have lots!)
Very fast analytics (Pig!)
For data-parallel applications (not good at crossproducts use
Huygens or Lisa!)
Legacy applications possible through piping / streaming (Weird
dependencies? Use Cloud!)
We'll do another Hackathon on Hadoop. Interested? Send me a mail!
BioAssist Programmers' Day, January 21, 2011