際際滷

際際滷Share a Scribd company logo
Large-Scale Data Storage and Processing
          for Scientists  in The Netherlands




Evert.Lammerts@SARA.nl
January 21, 2011
NBIC BioAssist Programmers' Day
Super computing




Cloud computing                                         Grid computing




        Cluster computing               GPU computing




             BioAssist Programmers' Day, January 21, 2011
Status Quo:
Storage separate from Compute




       BioAssist Programmers' Day, January 21, 2011
Case Study:
Virtual Knowledge Studio
 http://simshelf2.virtualknowledgestudio.nl/activities/biggrid-wikipedia-experiment




                                                                 How do categories in WikiPedia
                                                                  evolve over time? (And how do
                                                                  they relate to internal links?)

                                                                 2.7 TB raw text, single file

                                                                 Java application, searches for
                                                                  categories in Wiki markup,
                                                                  like [[Category:NAME]]

                                                                 Executed on the Grid




            BioAssist Programmers' Day, January 21, 2011
1.1) Copy file from local       2.1) Stream file from Grid                  3.1) Process all files in
   Machine to Grid storage         Storage to single machine                   parallel: N machines
                                2.2) Cut into pieces of 10 GB                  run the Java application,
                                2.3) Stream back to Grid                       fetch a 10GB file as
                                   Storage                                     input, processing it, and
                                                                               putting the result back
                             BioAssist Programmers' Day, January 21, 2011
Status Quo:
           Arrange your own (Data-)parallelism
   Cut the dataset up in processable chunks:
       Size of chunk depending on local space on processing node...
        on the total processing capacity available 
        on the smallest unit of work (largest grade of parallelism)...
        on the substance (sometimes you don't want many output files, e.g.
        when building a search index).
   Submit the amount of jobs you consider necessary:
       To a cluster close to your data (270x10GB over WAN is a bad idea)
       Amount might depend on cluster capacity, amount of chunks, smallest
        unit of work, substance...

         When dealing with large data, let's say 100GB+, this is
        ERROR PRONE, TIME CONSUMING AND NOT FOR NEWBIES!

                             BioAssist Programmers' Day, January 21, 2011
2002             2004                     2006

Nutch*           MR/GFS**                 Hadoop


*http://nutch.apache.org/
**http://labs.google.com/papers/mapreduce.html
http://labs.google.com/papers/gfs.html

                     BioAssist Programmers' Day, January 21, 2011
2010/2011: A Hype in Production




http://wiki.apache.org/hadoop/PoweredBy
                BioAssist Programmers' Day, January 21, 2011
BioAssist Programmers' Day, January 21, 2011
Hadoop Distributed File System (HDFS)
   Very large DFS. Order of magnitude:
       10k nodes
       millions of files
       PetaBytes of storage
   Assumes commodity hardware:
       redundancy through replication
       failure handling and recovery
   Optimized for batch processing:
       locations of data exposed to computation
       high aggregate bandwidth
                     BioAssist Programmers' Day, January 21, 2011
                                                                    http://www.slideshare.net/jhammerb/hdfs-architecture
HDFS Continued...
   Single Namespace for the entire cluster
   Data coherency
       Write-once-read-many model
       Only appending is supported for existing files
   Files are broken up in chunks (blocks)
       Blocksizes ranging from 64 to 256 MB, depending on configuration
       Blocks are distributed over nodes (a single FILE, existing of N
        blocks, is stored on M nodes)
       Blocks are replicated and replications are distributed
   Client accesses the blocks of a file at the nodes directly
       This creates high aggregate bandwidth!
                       BioAssist Programmers' Day, January 21, 2011
                                                                      http://www.slideshare.net/jhammerb/hdfs-architecture
HDFS NameNode & DataNodes




NameNode                                                  DataNode
   Manages File System Namespace                             A Block Server
       Mapping filename to blocks                                Stores data in local FS
       Mapping blocks to DataNode                                Stores metadata of a block (e.g. hash)
   Cluster Configuration                                         Serve (meta)data to clients
   Replication Management                                    Facilitates pipeline to other DN's
                                 BioAssist Programmers' Day, January 21, 2011
http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html




                            Metadata operations
                                Communicate with NN only
                                     ls (see above), lsr, df, du, chmod,
                                      chown... etc.

                            R/W (block) operations
                                Communicate with NN and DN's
                                     put, copyFromLocal, CopyToLocal,
                                      tail... etc.

         BioAssist Programmers' Day, January 21, 2011
HDFS
     Application Programming Interface (API)
   Enables programmers to access any HDFS from their code
   Described at
    http://hadoop.apache.org/common/docs/r0.20.0/api/index.html
   Written in (and thus available for) Java, but...
   Is also exposed through Apache Thrift, so can be accessed
    from:
       C++, Python, PHP, Ruby, and others
       See http://wiki.apache.org/hadoop/HDFS-APIs
   Has a separate C-API (libhdfs)

So: you can enable your services to work with HDFS
                       BioAssist Programmers' Day, January 21, 2011
MapReduce
   Is a framework for distributed (parallel) processing
    of large datasets
   Provides a programming model
   Lets users plug-in own code
   Uses a common pattern:
      cat|grep|sort|unique>file
      input|map|shuffle|reduce>output
   Is useful for large scale data analytics and
    processing
                 BioAssist Programmers' Day, January 21, 2011
MapReduce Continued...
   Is great for processing large datasets!
       Send computation to data, so little data over lines
       Uses blocks stored in the DFS, so no splitting required
        (this is a bit more sophisticated depending on your input)
   Handles parallelism for you
       One map per block, if possible
   Scales basically linearly
       time_on_cluster = time_on_single_core / total_cores
   Java, but streaming possible (plus others, see later)
                    BioAssist Programmers' Day, January 21, 2011
MapReduce
                          JobTracker & TaskTrackers




JobTracker                                                        TaskTracker
   Holds job metadata                                               Requests work from JT
       Status of job                                                    Fetch the code to execute from the DFS
       Status of Tasks running on TTs                                   Apply job-specific configuration
   Decides on scheduling                                            Communicate with JT on tasks:
   Delegates creation of 'InputSplits'                                  Sending output, Killing tasks, Task updates, etc

                                         BioAssist Programmers' Day, January 21, 2011
MapReduce client




BioAssist Programmers' Day, January 21, 2011
MapReduce
     Application Programming Interface (API)
   Enables programmers to write MapReduce jobs
       More info on MR jobs:
        http://www.slideshare.net/evertlammerts/infodoc-6107350
   Enables programmers to communicate with a
    JobTracker
       Submitting jobs, getting statuses, cancelling jobs,
        etc
   Described at
    http://hadoop.apache.org/common/docs/r0.20.0/api/index.html


                      BioAssist Programmers' Day, January 21, 2011
Case Study:
   Virtual Knowledge Studio




1) Load file into                         2) Submit code
   HDFS                                      to MR
        BioAssist Programmers' Day, January 21, 2011
What's more on Hadoop?

Lots!
   Apache Pig http://pig.apache.org
        Analyze datasets in a high level language, Pig Latin
        Simple! SQL like. Extremely fast experiments.
        N-stage jobs (MR chaining!)
   Apache Hive http://hive.apache.org
        Data Warehousing
        Hive QL
   Apache Hbase http://hbase.apache.org
        BigTable implementation (Google)
        In-memory operation
        Performance good enough for websites (Facebook built its Messaging Platform on top of it)
   Yahoo! Oozie http://yahoo.github.com/oozie/
        Hadoop workflow engine
   Apache [AVRO | Chukwa | Hama | Mahout] and so on
   3rd Party:
        ElephantBird
        Cloudera's Distribution for Hadoop
        Hue
        Yahoo's Distribution for Hadoop

                                          BioAssist Programmers' Day, January 21, 2011
Hadoop @ SARA

A prototype cluster
   Since December 2010
   20 cores for MR (TT's)
   110 TB gross for HDFS (DN) (55TB net)
   Hue web-interface for job submission & management
   SFTP interface to HDFS
   Pig 0.8
   Hive
   Available for scientists / scientific programmers until May / June 2011

Towards a production infrastructure?
   Depending on results


    It's open for you all as well: ask me for an account!
                             BioAssist Programmers' Day, January 21, 2011
BioAssist Programmers' Day, January 21, 2011
Hadoop for:
       Large-scale data storage and processing
       Fundamental difference: data locality!
       Small files? Don't, but... Hadoop Archives (HAR)
       Archival? Don't. Use tape storage. (We have lots!)
       Very fast analytics (Pig!)
       For data-parallel applications (not good at crossproducts  use
        Huygens or Lisa!)
       Legacy applications possible through piping / streaming (Weird
        dependencies? Use Cloud!)

We'll do another Hackathon on Hadoop. Interested? Send me a mail!
                          BioAssist Programmers' Day, January 21, 2011

More Related Content

Large-Scale Data Storage and Processing for Scientists with Hadoop

  • 1. Large-Scale Data Storage and Processing for Scientists in The Netherlands Evert.Lammerts@SARA.nl January 21, 2011 NBIC BioAssist Programmers' Day
  • 2. Super computing Cloud computing Grid computing Cluster computing GPU computing BioAssist Programmers' Day, January 21, 2011
  • 3. Status Quo: Storage separate from Compute BioAssist Programmers' Day, January 21, 2011
  • 4. Case Study: Virtual Knowledge Studio http://simshelf2.virtualknowledgestudio.nl/activities/biggrid-wikipedia-experiment How do categories in WikiPedia evolve over time? (And how do they relate to internal links?) 2.7 TB raw text, single file Java application, searches for categories in Wiki markup, like [[Category:NAME]] Executed on the Grid BioAssist Programmers' Day, January 21, 2011
  • 5. 1.1) Copy file from local 2.1) Stream file from Grid 3.1) Process all files in Machine to Grid storage Storage to single machine parallel: N machines 2.2) Cut into pieces of 10 GB run the Java application, 2.3) Stream back to Grid fetch a 10GB file as Storage input, processing it, and putting the result back BioAssist Programmers' Day, January 21, 2011
  • 6. Status Quo: Arrange your own (Data-)parallelism Cut the dataset up in processable chunks: Size of chunk depending on local space on processing node... on the total processing capacity available on the smallest unit of work (largest grade of parallelism)... on the substance (sometimes you don't want many output files, e.g. when building a search index). Submit the amount of jobs you consider necessary: To a cluster close to your data (270x10GB over WAN is a bad idea) Amount might depend on cluster capacity, amount of chunks, smallest unit of work, substance... When dealing with large data, let's say 100GB+, this is ERROR PRONE, TIME CONSUMING AND NOT FOR NEWBIES! BioAssist Programmers' Day, January 21, 2011
  • 7. 2002 2004 2006 Nutch* MR/GFS** Hadoop *http://nutch.apache.org/ **http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html BioAssist Programmers' Day, January 21, 2011
  • 8. 2010/2011: A Hype in Production http://wiki.apache.org/hadoop/PoweredBy BioAssist Programmers' Day, January 21, 2011
  • 9. BioAssist Programmers' Day, January 21, 2011
  • 10. Hadoop Distributed File System (HDFS) Very large DFS. Order of magnitude: 10k nodes millions of files PetaBytes of storage Assumes commodity hardware: redundancy through replication failure handling and recovery Optimized for batch processing: locations of data exposed to computation high aggregate bandwidth BioAssist Programmers' Day, January 21, 2011 http://www.slideshare.net/jhammerb/hdfs-architecture
  • 11. HDFS Continued... Single Namespace for the entire cluster Data coherency Write-once-read-many model Only appending is supported for existing files Files are broken up in chunks (blocks) Blocksizes ranging from 64 to 256 MB, depending on configuration Blocks are distributed over nodes (a single FILE, existing of N blocks, is stored on M nodes) Blocks are replicated and replications are distributed Client accesses the blocks of a file at the nodes directly This creates high aggregate bandwidth! BioAssist Programmers' Day, January 21, 2011 http://www.slideshare.net/jhammerb/hdfs-architecture
  • 12. HDFS NameNode & DataNodes NameNode DataNode Manages File System Namespace A Block Server Mapping filename to blocks Stores data in local FS Mapping blocks to DataNode Stores metadata of a block (e.g. hash) Cluster Configuration Serve (meta)data to clients Replication Management Facilitates pipeline to other DN's BioAssist Programmers' Day, January 21, 2011
  • 13. http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html Metadata operations Communicate with NN only ls (see above), lsr, df, du, chmod, chown... etc. R/W (block) operations Communicate with NN and DN's put, copyFromLocal, CopyToLocal, tail... etc. BioAssist Programmers' Day, January 21, 2011
  • 14. HDFS Application Programming Interface (API) Enables programmers to access any HDFS from their code Described at http://hadoop.apache.org/common/docs/r0.20.0/api/index.html Written in (and thus available for) Java, but... Is also exposed through Apache Thrift, so can be accessed from: C++, Python, PHP, Ruby, and others See http://wiki.apache.org/hadoop/HDFS-APIs Has a separate C-API (libhdfs) So: you can enable your services to work with HDFS BioAssist Programmers' Day, January 21, 2011
  • 15. MapReduce Is a framework for distributed (parallel) processing of large datasets Provides a programming model Lets users plug-in own code Uses a common pattern: cat|grep|sort|unique>file input|map|shuffle|reduce>output Is useful for large scale data analytics and processing BioAssist Programmers' Day, January 21, 2011
  • 16. MapReduce Continued... Is great for processing large datasets! Send computation to data, so little data over lines Uses blocks stored in the DFS, so no splitting required (this is a bit more sophisticated depending on your input) Handles parallelism for you One map per block, if possible Scales basically linearly time_on_cluster = time_on_single_core / total_cores Java, but streaming possible (plus others, see later) BioAssist Programmers' Day, January 21, 2011
  • 17. MapReduce JobTracker & TaskTrackers JobTracker TaskTracker Holds job metadata Requests work from JT Status of job Fetch the code to execute from the DFS Status of Tasks running on TTs Apply job-specific configuration Decides on scheduling Communicate with JT on tasks: Delegates creation of 'InputSplits' Sending output, Killing tasks, Task updates, etc BioAssist Programmers' Day, January 21, 2011
  • 19. MapReduce Application Programming Interface (API) Enables programmers to write MapReduce jobs More info on MR jobs: http://www.slideshare.net/evertlammerts/infodoc-6107350 Enables programmers to communicate with a JobTracker Submitting jobs, getting statuses, cancelling jobs, etc Described at http://hadoop.apache.org/common/docs/r0.20.0/api/index.html BioAssist Programmers' Day, January 21, 2011
  • 20. Case Study: Virtual Knowledge Studio 1) Load file into 2) Submit code HDFS to MR BioAssist Programmers' Day, January 21, 2011
  • 21. What's more on Hadoop? Lots! Apache Pig http://pig.apache.org Analyze datasets in a high level language, Pig Latin Simple! SQL like. Extremely fast experiments. N-stage jobs (MR chaining!) Apache Hive http://hive.apache.org Data Warehousing Hive QL Apache Hbase http://hbase.apache.org BigTable implementation (Google) In-memory operation Performance good enough for websites (Facebook built its Messaging Platform on top of it) Yahoo! Oozie http://yahoo.github.com/oozie/ Hadoop workflow engine Apache [AVRO | Chukwa | Hama | Mahout] and so on 3rd Party: ElephantBird Cloudera's Distribution for Hadoop Hue Yahoo's Distribution for Hadoop BioAssist Programmers' Day, January 21, 2011
  • 22. Hadoop @ SARA A prototype cluster Since December 2010 20 cores for MR (TT's) 110 TB gross for HDFS (DN) (55TB net) Hue web-interface for job submission & management SFTP interface to HDFS Pig 0.8 Hive Available for scientists / scientific programmers until May / June 2011 Towards a production infrastructure? Depending on results It's open for you all as well: ask me for an account! BioAssist Programmers' Day, January 21, 2011
  • 23. BioAssist Programmers' Day, January 21, 2011
  • 24. Hadoop for: Large-scale data storage and processing Fundamental difference: data locality! Small files? Don't, but... Hadoop Archives (HAR) Archival? Don't. Use tape storage. (We have lots!) Very fast analytics (Pig!) For data-parallel applications (not good at crossproducts use Huygens or Lisa!) Legacy applications possible through piping / streaming (Weird dependencies? Use Cloud!) We'll do another Hackathon on Hadoop. Interested? Send me a mail! BioAssist Programmers' Day, January 21, 2011