際際滷

際際滷Share a Scribd company logo
Large-scale data processing [at SARA] [with Apache Hadoop] Evert Lammerts February 9, 2012, Netherlands Hadoop User Group
Who's who?
Who's who? Who has worked on scale? e.g. database sharding, round-robin HTTP, Hadoop, key-value databases, anything else over multiple nodes? >= 5 nodes, >= 10 nodes, >= 50 nodes, >= 100 nodes?
In this talk Why large-scale data processing?
An introduction to scale @ SARA
An introduction to Hadoop & MapReduce
Hadoop @ SARA
Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
(Jimmy Lin, University of Maryland / Twitter, 2011)
(IEEE Intelligent Systems, 03/04-2009, vol 24, issue 2, p8-12)
s/knowledge/data/g* HTTP logs, Click data, Query logs, CRM data, Financial data, Social networks, Archives, Crawls, and many more You already have your data (*Jimmy Lin, University of Maryland / Twitter, 2011)
Data-processing as a commodity Cheap Clusters
Simple programming models
Easy-to-learn scripting
Anybody with the know-how can generate insights!
Note:  the know-how  = Data Science DevOps Programming algorithms Domain knowledge
Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
SARA the national center for scientific computing Facilitating   Science   in The Netherlands with Equipment for and Expertise on L arge-Scale   Computing ,  L arge-Scale   Data Storage ,  H igh-Performance   Networking ,  eScience ,  and   Visualization
Large-scale data != new
Different types of computing Parallelism Data parallelism
Task parallelism Architectures SIMD: Single Instruction Multiple Data
MIMD: Multiple Instruction Multiple Data
MISD: Multiple Instruction Single Data
SISD: Single Instruction Single Data (Von Neumann)
Parallelism: Amdahl's law
Data parallelism
Compute @ SARA
What's different about Hadoop? No more do-it-yourself parallelism  it's hard! But rather linearly scalable data parallelism Separating the  what  from the  how (NYT, 14/06/2006)
Why large-scale data processing? An introduction to scale @ SARA An introduction to Hadoop & MapReduce Hadoop @ SARA
A bit of history Nutch* 2002 2004 MR/GFS** 2006 2004 Hadoop *  http://nutch.apache.org/ **  http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html
http://wiki.apache.org/hadoop/PoweredBy 2010 - 2012: A Hype in Production
Core principals Scale out, not up
Move processing to the data
Process data sequentially, avoid random reads
Seamless scalability (Jimmy Lin, University of Maryland / Twitter, 2011)
A typical data-parallel problem in abstraction Iterate over a large number of records
Extract something of interest
Create an ordering in intermediate results

More Related Content

First NL-HUG: Large-scale data processing at SARA with Apache Hadoop