際際滷

際際滷Share a Scribd company logo
Data model for analysis of scholarly documents in the
               MapReduce paradigm

 Adam Kawa        Lukasz Bolikowski Artur Czeczko        Piotr Jan Dendek
                            Dominika Tkaczyk

                   Centre for Open Science (CeON), ICM UW


                          Warsaw, July 6, 2012




  (CeON ICM UW)           Apache Hadoop in CeON ICM UW      Warsaw, July 6 2012   1 / 19
Agenda




 1   Problem de鍖nition
 2   Requirements speci鍖cation
 3   Exemplary solutions based on Apache Hadoop Ecosystem tools




      (CeON ICM UW)         Apache Hadoop in CeON ICM UW   Warsaw, July 6 2012   2 / 19
The data that we are in possession



Vast collections of scholarly documents to store
     10 million of full texts
     (PDF, plain text)

     17 million of document metadata records
     (described in XML-based BWMeta format)

     4TB of data
     (10TB including data archives)




      (CeON ICM UW)             Apache Hadoop in CeON ICM UW   Warsaw, July 6 2012   3 / 19
The tasks that we are doing


Big knowledge to extract and discover
    17 million of document metadata records (XML)
          contain title, subtitles, abstract, keywords, references, contributors and
          their a鍖liations, publishing magazine, . . .
    input for many state-of-the-art machine learning algorithms
          relatively simple ones: searching documents with given title, 鍖nding
          scienti鍖c teams, . . .
          quite complex ones: author name disambiguation, bibliometrics,
          classi鍖cation code assignment, . . .




      (CeON ICM UW)           Apache Hadoop in CeON ICM UW      Warsaw, July 6 2012   4 / 19
The requirements that we have speci鍖ed


Multiple demands regarding storage and processing of large amounts of data:

     scalability and parallelism  easily handle tens of terabytes of data and
     parallelize the computation e鍖ectively
     鍖exible data model  possibility to add or update data, and enhance it
     content by implicit information discovered by our algorithms
     latency requirements  support batch o鍖ine processing as well as
     random, realtime read/write requests
     availability of many clients  accessible by programmers and researchers
     with diverse language preferences and expertise
     reliablility and cost-e鍖ectiveness  ideally an open-source software which
     does not require expensive hardware




      (CeON ICM UW)          Apache Hadoop in CeON ICM UW    Warsaw, July 6 2012   5 / 19
Document-related data as linked data
Information about document-related resources can be simply described a directed
labeled graph

     entities (e.g. documents, contributors, references) are nodes in the graph
     relationships between entities are directed labeled edges in the graph




      (CeON ICM UW)           Apache Hadoop in CeON ICM UW     Warsaw, July 6 2012   6 / 19
Linked graph as a collection of RDF triples
A directed labeled graph can be simply represented a collection of RDF triples

     a triple consists of subject, predicate and object
     a triple represents a statement which denotes that a resource (subject) holds
     a value (object) for some attribute (predicate) of that resource
     a triple can represent any statements about any resource




      (CeON ICM UW)           Apache Hadoop in CeON ICM UW      Warsaw, July 6 2012   7 / 19
Hadoop as a solution for scalability/performance issues
Apache Hadoop is most commonly used open-source solution for storing and
processing big data in reliable, high-performance and cost-e鍖ective way.

    Scalable storage
    Parallel processing
    Subprojects and many Hadoop-related projects
          HDFS  distributed 鍖le system that provides high-throughput access
          to large data
          MapReduce  framework for distributed processing of large data sets
          (Java and e.g. JavaScript, Python, Perl, Ruby via Streaming)
          HBase  scalable, distributed data store with 鍖exible schema, random
          read/write access and fast scans
          Pig/Hive  higher-level abstractions on top of MapReduce (simple
          data manipulation languages)




      (CeON ICM UW)         Apache Hadoop in CeON ICM UW   Warsaw, July 6 2012   8 / 19
Apache Hadoop Ecosystem tools as RDF triple stores

   SHARD [3]  a Hadoop backed RDF triple store
        stores triples in 鍖at 鍖les in HDFS
        data cannot be modi鍖ed randomly
        less e鍖cient for queries that requires the inspection of only a small
        number of triples
   PigSPARQL [6]  translates SPARQL queries to Pig Latin programs and
   runs them on Hadoop cluster
        stores RDF triples with the same predicate in separate, 鍖at 鍖les in
        HDFS
   H2RDF [5]  a RDF store that combines MapReduce with HBase
        stores triples in HBase using three 鍖at-wide tables
   Jena-HBase [4]  a HBase backed RDF triple store
        provides six di鍖erent pluggable HBase storage layouts


    (CeON ICM UW)          Apache Hadoop in CeON ICM UW       Warsaw, July 6 2012   9 / 19
HBase as storage layer for RDF triples

Storing RDF triples in Apache HBase has several advantages

     鍖exible data model  columns can be dynamically added and removed;
     multiple versions of data in a particular cell; data serialized to a byte array
     random read and write  more suitable for semi-structured RDF data
     than HDFS where 鍖les cannot be modi鍖ed randomly and usually whole 鍖le
     must be read sequentially to 鍖nd subset of records
     availability of many clients
          interactive clients  native Java API, REST or Apache Thrift
          batch clients  MapReduce (Java), Pig (PigLatin) and Hive
          (HiveQL)
     automatically sorted records  quick lookups and partial scans; joins as
     fast (linear) merge-joins



      (CeON ICM UW)            Apache Hadoop in CeON ICM UW     Warsaw, July 6 2012   10 / 19
Exemplary HBase schema  Flat-wide layout

   Advantages
        no prior knowledge about data is required
        colocation of all information about a resource within a single row
        support of multi-valued properties
        support of rei鍖ed statements (statements about statements)
   Disadvantages
        unlimited number of columns
        increase of storage space




    (CeON ICM UW)          Apache Hadoop in CeON ICM UW    Warsaw, July 6 2012   11 / 19
Exemplary HBase schema - Vertically Partitioned layout [1]

    Advantages
         support of multi-valued properties
         support of rei鍖ed statements (statements about statements)
         storage space savings when compared to the previous layout
         鍖rst-step (predicate-bound) pairwise joins as fast merge-joins
    Disadvantages
         increased number of joins




     (CeON ICM UW)          Apache Hadoop in CeON ICM UW    Warsaw, July 6 2012   12 / 19
Exemplary HBase schema - Hexastore layout [2]
   Advantages
       support of multi-valued properties
       support of rei鍖ed statements (statements about statements)
       鍖rst-step pairwise joins as fast merge-joins
   Disadvantages
       increased of number of joins
       increase of storage space
       complication of an update operation




    (CeON ICM UW)         Apache Hadoop in CeON ICM UW   Warsaw, July 6 2012   13 / 19
HBase schema - other layout




Some derivative and hybrid layouts exist to combine the advantages of original
layouts
     a combination of the vertically partitioned and the hexastore layout [4]
     a combination of the 鍖at-wide and the vertically partitioned layouts [4]




      (CeON ICM UW)          Apache Hadoop in CeON ICM UW    Warsaw, July 6 2012   14 / 19
Challenges




   a large number of join operations
        relatively expensive
        and practically cannot be avoided (at least for more complex queries)
        but specialized join techniques can be used e.g. multi join, merge-sort
        join, replicated join, skewed join
   lack of a native support for cross-row atomicity (e.g. in the form of
   transactions)




    (CeON ICM UW)          Apache Hadoop in CeON ICM UW    Warsaw, July 6 2012   15 / 19
Possible performance optimization techniques



    property tables  properties often queried together are stored in the same
    record for a quick access [8, 9]
    materialized path expressions  precalculation and materialization of
    the most commonly used paths through an RDF graph in advance [1, 2]
    graph-oriented partitioning scheme [7]
         take advantage of the spatial locality inherent in graph pattern
         matching
         higher replication of data that is on the border of any particular
         partition (however, problematic for a graph that is modi鍖ed)




     (CeON ICM UW)          Apache Hadoop in CeON ICM UW     Warsaw, July 6 2012   16 / 19
The ways of processing data from HBase

Many various tools are integrated with HBase and can read data from and write
data to HBase tables
     Java MapReduce
          possibility to use our legacy Java code in map and reduce methods
          delivers better perfromance than Apache Pig
     Apache Pig
          provides common data operations (e.g. 鍖lters, unions, joins, ordering)
          and nested types (e.g. tuples, bags, maps)
          supports multiple specialized joins implementation
          possibility to run MapReduce jobs directly from PigLatin scripts
          can be embeded in Python code
     Interactive clients (e.g. Java API, REST or Apache Thrift)
          interactive access to relatively small subset of our data by sending API
          calls on demand e.g. a web-based client


      (CeON ICM UW)          Apache Hadoop in CeON ICM UW     Warsaw, July 6 2012   17 / 19
Case study: author name disambiguation algorithm




The most complex algorithm that we have run over Apache HBase so far is
author name disambiguation algorithm.




      (CeON ICM UW)         Apache Hadoop in CeON ICM UW   Warsaw, July 6 2012   18 / 19
Thanks!




                      More information about CeON:
                      http://ceon.pl/en/research




    c 2012 Adam Kawa. Ten dokument jest dostepny na licencji Creative Commons Uznanie autorstwa 3.0 Polska
              Tre卒卒 licencji dostepna pod adresem: http://creativecommons.org/licenses/by/3.0/pl/
                 sc




     (CeON ICM UW)                    Apache Hadoop in CeON ICM UW                    Warsaw, July 6 2012    19 / 19
D. J. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach. Scalable
Semantic Web Data Management using vertical partitioning. In VLDB, pages
411422, 2007.

C. Weiss, P. Karras, and A. Bernstein. Hexastore: Sextuple Indexing for
Semantic Web Data Management. In VLDB, pages 1008-1019, 2008.

K. Rohlo鍖 and R. Schantz. High-performance, massively scalable distributed
systems using the mapreduce software framework: The shard triple-store.
International Workshop on Programming Support Innovations for Emerging
Distributed Applications, 2010.

V. Khadilkar, M. Kantarcioglu, P. Castagna, and B. Thuraisingham.
Jena-HBase: A Distributed, Scalable and E鍖鍖ent RDF Triple Store. Technical
report, 2012. http://www.utdallas.edu/ vvk072000/Research/Jena-HBase-
Ext/tech-report.pdf

N. Papailiou, I. Konstantinou, D. Tsoumakos and N. Koziris. H2RDF:
Adaptive Query Processing on RDF Data in the Cloud. In Proceedings of the
21th International Conference on World Wide Web (WWW demo track),
Lyon, France, 2012.
  (CeON ICM UW)         Apache Hadoop in CeON ICM UW    Warsaw, July 6 2012   19 / 19
A. Sch即tzle, M. Przyjaciel-Zablocki and G. Lausen: PigSPARQL: Mapping
       a
SPARQL to Pig Latin. 3th International Workshop on Semantic Web
Information Management (SWIM 2011), in conjunction with the 2011 ACM
International Conference on Management of Data (SIGMOD 2011). Athens
(Greece).

J. Huang, D. Abadi and K. Ren. Scalable SPARQL Querying of Large RDF
Graphs. VLDB Endowment, Volume 4 (VLDB 2011).

K. Wilkinson, C. Sayers, H. Kuno, and D. Reynolds. E鍖cient RDF Storage
and Retrieval in Jena2. In SWDB, pages 131150.

K. Wilkinson. Jena property table implementation. In SSWS, 2006.




  (CeON ICM UW)        Apache Hadoop in CeON ICM UW   Warsaw, July 6 2012   19 / 19

More Related Content

Data model for analysis of scholarly documents in the MapReduce paradigm

  • 1. Data model for analysis of scholarly documents in the MapReduce paradigm Adam Kawa Lukasz Bolikowski Artur Czeczko Piotr Jan Dendek Dominika Tkaczyk Centre for Open Science (CeON), ICM UW Warsaw, July 6, 2012 (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 1 / 19
  • 2. Agenda 1 Problem de鍖nition 2 Requirements speci鍖cation 3 Exemplary solutions based on Apache Hadoop Ecosystem tools (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 2 / 19
  • 3. The data that we are in possession Vast collections of scholarly documents to store 10 million of full texts (PDF, plain text) 17 million of document metadata records (described in XML-based BWMeta format) 4TB of data (10TB including data archives) (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 3 / 19
  • 4. The tasks that we are doing Big knowledge to extract and discover 17 million of document metadata records (XML) contain title, subtitles, abstract, keywords, references, contributors and their a鍖liations, publishing magazine, . . . input for many state-of-the-art machine learning algorithms relatively simple ones: searching documents with given title, 鍖nding scienti鍖c teams, . . . quite complex ones: author name disambiguation, bibliometrics, classi鍖cation code assignment, . . . (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 4 / 19
  • 5. The requirements that we have speci鍖ed Multiple demands regarding storage and processing of large amounts of data: scalability and parallelism easily handle tens of terabytes of data and parallelize the computation e鍖ectively 鍖exible data model possibility to add or update data, and enhance it content by implicit information discovered by our algorithms latency requirements support batch o鍖ine processing as well as random, realtime read/write requests availability of many clients accessible by programmers and researchers with diverse language preferences and expertise reliablility and cost-e鍖ectiveness ideally an open-source software which does not require expensive hardware (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 5 / 19
  • 6. Document-related data as linked data Information about document-related resources can be simply described a directed labeled graph entities (e.g. documents, contributors, references) are nodes in the graph relationships between entities are directed labeled edges in the graph (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 6 / 19
  • 7. Linked graph as a collection of RDF triples A directed labeled graph can be simply represented a collection of RDF triples a triple consists of subject, predicate and object a triple represents a statement which denotes that a resource (subject) holds a value (object) for some attribute (predicate) of that resource a triple can represent any statements about any resource (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 7 / 19
  • 8. Hadoop as a solution for scalability/performance issues Apache Hadoop is most commonly used open-source solution for storing and processing big data in reliable, high-performance and cost-e鍖ective way. Scalable storage Parallel processing Subprojects and many Hadoop-related projects HDFS distributed 鍖le system that provides high-throughput access to large data MapReduce framework for distributed processing of large data sets (Java and e.g. JavaScript, Python, Perl, Ruby via Streaming) HBase scalable, distributed data store with 鍖exible schema, random read/write access and fast scans Pig/Hive higher-level abstractions on top of MapReduce (simple data manipulation languages) (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 8 / 19
  • 9. Apache Hadoop Ecosystem tools as RDF triple stores SHARD [3] a Hadoop backed RDF triple store stores triples in 鍖at 鍖les in HDFS data cannot be modi鍖ed randomly less e鍖cient for queries that requires the inspection of only a small number of triples PigSPARQL [6] translates SPARQL queries to Pig Latin programs and runs them on Hadoop cluster stores RDF triples with the same predicate in separate, 鍖at 鍖les in HDFS H2RDF [5] a RDF store that combines MapReduce with HBase stores triples in HBase using three 鍖at-wide tables Jena-HBase [4] a HBase backed RDF triple store provides six di鍖erent pluggable HBase storage layouts (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 9 / 19
  • 10. HBase as storage layer for RDF triples Storing RDF triples in Apache HBase has several advantages 鍖exible data model columns can be dynamically added and removed; multiple versions of data in a particular cell; data serialized to a byte array random read and write more suitable for semi-structured RDF data than HDFS where 鍖les cannot be modi鍖ed randomly and usually whole 鍖le must be read sequentially to 鍖nd subset of records availability of many clients interactive clients native Java API, REST or Apache Thrift batch clients MapReduce (Java), Pig (PigLatin) and Hive (HiveQL) automatically sorted records quick lookups and partial scans; joins as fast (linear) merge-joins (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 10 / 19
  • 11. Exemplary HBase schema Flat-wide layout Advantages no prior knowledge about data is required colocation of all information about a resource within a single row support of multi-valued properties support of rei鍖ed statements (statements about statements) Disadvantages unlimited number of columns increase of storage space (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 11 / 19
  • 12. Exemplary HBase schema - Vertically Partitioned layout [1] Advantages support of multi-valued properties support of rei鍖ed statements (statements about statements) storage space savings when compared to the previous layout 鍖rst-step (predicate-bound) pairwise joins as fast merge-joins Disadvantages increased number of joins (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 12 / 19
  • 13. Exemplary HBase schema - Hexastore layout [2] Advantages support of multi-valued properties support of rei鍖ed statements (statements about statements) 鍖rst-step pairwise joins as fast merge-joins Disadvantages increased of number of joins increase of storage space complication of an update operation (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 13 / 19
  • 14. HBase schema - other layout Some derivative and hybrid layouts exist to combine the advantages of original layouts a combination of the vertically partitioned and the hexastore layout [4] a combination of the 鍖at-wide and the vertically partitioned layouts [4] (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 14 / 19
  • 15. Challenges a large number of join operations relatively expensive and practically cannot be avoided (at least for more complex queries) but specialized join techniques can be used e.g. multi join, merge-sort join, replicated join, skewed join lack of a native support for cross-row atomicity (e.g. in the form of transactions) (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 15 / 19
  • 16. Possible performance optimization techniques property tables properties often queried together are stored in the same record for a quick access [8, 9] materialized path expressions precalculation and materialization of the most commonly used paths through an RDF graph in advance [1, 2] graph-oriented partitioning scheme [7] take advantage of the spatial locality inherent in graph pattern matching higher replication of data that is on the border of any particular partition (however, problematic for a graph that is modi鍖ed) (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 16 / 19
  • 17. The ways of processing data from HBase Many various tools are integrated with HBase and can read data from and write data to HBase tables Java MapReduce possibility to use our legacy Java code in map and reduce methods delivers better perfromance than Apache Pig Apache Pig provides common data operations (e.g. 鍖lters, unions, joins, ordering) and nested types (e.g. tuples, bags, maps) supports multiple specialized joins implementation possibility to run MapReduce jobs directly from PigLatin scripts can be embeded in Python code Interactive clients (e.g. Java API, REST or Apache Thrift) interactive access to relatively small subset of our data by sending API calls on demand e.g. a web-based client (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 17 / 19
  • 18. Case study: author name disambiguation algorithm The most complex algorithm that we have run over Apache HBase so far is author name disambiguation algorithm. (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 18 / 19
  • 19. Thanks! More information about CeON: http://ceon.pl/en/research c 2012 Adam Kawa. Ten dokument jest dostepny na licencji Creative Commons Uznanie autorstwa 3.0 Polska Tre卒卒 licencji dostepna pod adresem: http://creativecommons.org/licenses/by/3.0/pl/ sc (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 19 / 19
  • 20. D. J. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach. Scalable Semantic Web Data Management using vertical partitioning. In VLDB, pages 411422, 2007. C. Weiss, P. Karras, and A. Bernstein. Hexastore: Sextuple Indexing for Semantic Web Data Management. In VLDB, pages 1008-1019, 2008. K. Rohlo鍖 and R. Schantz. High-performance, massively scalable distributed systems using the mapreduce software framework: The shard triple-store. International Workshop on Programming Support Innovations for Emerging Distributed Applications, 2010. V. Khadilkar, M. Kantarcioglu, P. Castagna, and B. Thuraisingham. Jena-HBase: A Distributed, Scalable and E鍖鍖ent RDF Triple Store. Technical report, 2012. http://www.utdallas.edu/ vvk072000/Research/Jena-HBase- Ext/tech-report.pdf N. Papailiou, I. Konstantinou, D. Tsoumakos and N. Koziris. H2RDF: Adaptive Query Processing on RDF Data in the Cloud. In Proceedings of the 21th International Conference on World Wide Web (WWW demo track), Lyon, France, 2012. (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 19 / 19
  • 21. A. Sch即tzle, M. Przyjaciel-Zablocki and G. Lausen: PigSPARQL: Mapping a SPARQL to Pig Latin. 3th International Workshop on Semantic Web Information Management (SWIM 2011), in conjunction with the 2011 ACM International Conference on Management of Data (SIGMOD 2011). Athens (Greece). J. Huang, D. Abadi and K. Ren. Scalable SPARQL Querying of Large RDF Graphs. VLDB Endowment, Volume 4 (VLDB 2011). K. Wilkinson, C. Sayers, H. Kuno, and D. Reynolds. E鍖cient RDF Storage and Retrieval in Jena2. In SWDB, pages 131150. K. Wilkinson. Jena property table implementation. In SSWS, 2006. (CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 19 / 19