This document summarizes a presentation on using Apache Hadoop tools to analyze scholarly documents. It discusses storing metadata and text of scholarly documents and extracting knowledge from them. Requirements for scalable storage, parallel processing, and flexible data models are also outlined. Possible solutions for storing document relationship data as linked RDF triples in HBase and performing analytics using MapReduce, Pig, and Hive are presented.
1 of 21
More Related Content
Data model for analysis of scholarly documents in the MapReduce paradigm
1. Data model for analysis of scholarly documents in the
MapReduce paradigm
Adam Kawa Lukasz Bolikowski Artur Czeczko Piotr Jan Dendek
Dominika Tkaczyk
Centre for Open Science (CeON), ICM UW
Warsaw, July 6, 2012
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 1 / 19
2. Agenda
1 Problem de鍖nition
2 Requirements speci鍖cation
3 Exemplary solutions based on Apache Hadoop Ecosystem tools
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 2 / 19
3. The data that we are in possession
Vast collections of scholarly documents to store
10 million of full texts
(PDF, plain text)
17 million of document metadata records
(described in XML-based BWMeta format)
4TB of data
(10TB including data archives)
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 3 / 19
4. The tasks that we are doing
Big knowledge to extract and discover
17 million of document metadata records (XML)
contain title, subtitles, abstract, keywords, references, contributors and
their a鍖liations, publishing magazine, . . .
input for many state-of-the-art machine learning algorithms
relatively simple ones: searching documents with given title, 鍖nding
scienti鍖c teams, . . .
quite complex ones: author name disambiguation, bibliometrics,
classi鍖cation code assignment, . . .
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 4 / 19
5. The requirements that we have speci鍖ed
Multiple demands regarding storage and processing of large amounts of data:
scalability and parallelism easily handle tens of terabytes of data and
parallelize the computation e鍖ectively
鍖exible data model possibility to add or update data, and enhance it
content by implicit information discovered by our algorithms
latency requirements support batch o鍖ine processing as well as
random, realtime read/write requests
availability of many clients accessible by programmers and researchers
with diverse language preferences and expertise
reliablility and cost-e鍖ectiveness ideally an open-source software which
does not require expensive hardware
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 5 / 19
6. Document-related data as linked data
Information about document-related resources can be simply described a directed
labeled graph
entities (e.g. documents, contributors, references) are nodes in the graph
relationships between entities are directed labeled edges in the graph
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 6 / 19
7. Linked graph as a collection of RDF triples
A directed labeled graph can be simply represented a collection of RDF triples
a triple consists of subject, predicate and object
a triple represents a statement which denotes that a resource (subject) holds
a value (object) for some attribute (predicate) of that resource
a triple can represent any statements about any resource
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 7 / 19
8. Hadoop as a solution for scalability/performance issues
Apache Hadoop is most commonly used open-source solution for storing and
processing big data in reliable, high-performance and cost-e鍖ective way.
Scalable storage
Parallel processing
Subprojects and many Hadoop-related projects
HDFS distributed 鍖le system that provides high-throughput access
to large data
MapReduce framework for distributed processing of large data sets
(Java and e.g. JavaScript, Python, Perl, Ruby via Streaming)
HBase scalable, distributed data store with 鍖exible schema, random
read/write access and fast scans
Pig/Hive higher-level abstractions on top of MapReduce (simple
data manipulation languages)
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 8 / 19
9. Apache Hadoop Ecosystem tools as RDF triple stores
SHARD [3] a Hadoop backed RDF triple store
stores triples in 鍖at 鍖les in HDFS
data cannot be modi鍖ed randomly
less e鍖cient for queries that requires the inspection of only a small
number of triples
PigSPARQL [6] translates SPARQL queries to Pig Latin programs and
runs them on Hadoop cluster
stores RDF triples with the same predicate in separate, 鍖at 鍖les in
HDFS
H2RDF [5] a RDF store that combines MapReduce with HBase
stores triples in HBase using three 鍖at-wide tables
Jena-HBase [4] a HBase backed RDF triple store
provides six di鍖erent pluggable HBase storage layouts
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 9 / 19
10. HBase as storage layer for RDF triples
Storing RDF triples in Apache HBase has several advantages
鍖exible data model columns can be dynamically added and removed;
multiple versions of data in a particular cell; data serialized to a byte array
random read and write more suitable for semi-structured RDF data
than HDFS where 鍖les cannot be modi鍖ed randomly and usually whole 鍖le
must be read sequentially to 鍖nd subset of records
availability of many clients
interactive clients native Java API, REST or Apache Thrift
batch clients MapReduce (Java), Pig (PigLatin) and Hive
(HiveQL)
automatically sorted records quick lookups and partial scans; joins as
fast (linear) merge-joins
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 10 / 19
11. Exemplary HBase schema Flat-wide layout
Advantages
no prior knowledge about data is required
colocation of all information about a resource within a single row
support of multi-valued properties
support of rei鍖ed statements (statements about statements)
Disadvantages
unlimited number of columns
increase of storage space
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 11 / 19
12. Exemplary HBase schema - Vertically Partitioned layout [1]
Advantages
support of multi-valued properties
support of rei鍖ed statements (statements about statements)
storage space savings when compared to the previous layout
鍖rst-step (predicate-bound) pairwise joins as fast merge-joins
Disadvantages
increased number of joins
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 12 / 19
13. Exemplary HBase schema - Hexastore layout [2]
Advantages
support of multi-valued properties
support of rei鍖ed statements (statements about statements)
鍖rst-step pairwise joins as fast merge-joins
Disadvantages
increased of number of joins
increase of storage space
complication of an update operation
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 13 / 19
14. HBase schema - other layout
Some derivative and hybrid layouts exist to combine the advantages of original
layouts
a combination of the vertically partitioned and the hexastore layout [4]
a combination of the 鍖at-wide and the vertically partitioned layouts [4]
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 14 / 19
15. Challenges
a large number of join operations
relatively expensive
and practically cannot be avoided (at least for more complex queries)
but specialized join techniques can be used e.g. multi join, merge-sort
join, replicated join, skewed join
lack of a native support for cross-row atomicity (e.g. in the form of
transactions)
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 15 / 19
16. Possible performance optimization techniques
property tables properties often queried together are stored in the same
record for a quick access [8, 9]
materialized path expressions precalculation and materialization of
the most commonly used paths through an RDF graph in advance [1, 2]
graph-oriented partitioning scheme [7]
take advantage of the spatial locality inherent in graph pattern
matching
higher replication of data that is on the border of any particular
partition (however, problematic for a graph that is modi鍖ed)
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 16 / 19
17. The ways of processing data from HBase
Many various tools are integrated with HBase and can read data from and write
data to HBase tables
Java MapReduce
possibility to use our legacy Java code in map and reduce methods
delivers better perfromance than Apache Pig
Apache Pig
provides common data operations (e.g. 鍖lters, unions, joins, ordering)
and nested types (e.g. tuples, bags, maps)
supports multiple specialized joins implementation
possibility to run MapReduce jobs directly from PigLatin scripts
can be embeded in Python code
Interactive clients (e.g. Java API, REST or Apache Thrift)
interactive access to relatively small subset of our data by sending API
calls on demand e.g. a web-based client
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 17 / 19
18. Case study: author name disambiguation algorithm
The most complex algorithm that we have run over Apache HBase so far is
author name disambiguation algorithm.
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 18 / 19
19. Thanks!
More information about CeON:
http://ceon.pl/en/research
c 2012 Adam Kawa. Ten dokument jest dostepny na licencji Creative Commons Uznanie autorstwa 3.0 Polska
Tre卒卒 licencji dostepna pod adresem: http://creativecommons.org/licenses/by/3.0/pl/
sc
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 19 / 19
20. D. J. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach. Scalable
Semantic Web Data Management using vertical partitioning. In VLDB, pages
411422, 2007.
C. Weiss, P. Karras, and A. Bernstein. Hexastore: Sextuple Indexing for
Semantic Web Data Management. In VLDB, pages 1008-1019, 2008.
K. Rohlo鍖 and R. Schantz. High-performance, massively scalable distributed
systems using the mapreduce software framework: The shard triple-store.
International Workshop on Programming Support Innovations for Emerging
Distributed Applications, 2010.
V. Khadilkar, M. Kantarcioglu, P. Castagna, and B. Thuraisingham.
Jena-HBase: A Distributed, Scalable and E鍖鍖ent RDF Triple Store. Technical
report, 2012. http://www.utdallas.edu/ vvk072000/Research/Jena-HBase-
Ext/tech-report.pdf
N. Papailiou, I. Konstantinou, D. Tsoumakos and N. Koziris. H2RDF:
Adaptive Query Processing on RDF Data in the Cloud. In Proceedings of the
21th International Conference on World Wide Web (WWW demo track),
Lyon, France, 2012.
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 19 / 19
21. A. Sch即tzle, M. Przyjaciel-Zablocki and G. Lausen: PigSPARQL: Mapping
a
SPARQL to Pig Latin. 3th International Workshop on Semantic Web
Information Management (SWIM 2011), in conjunction with the 2011 ACM
International Conference on Management of Data (SIGMOD 2011). Athens
(Greece).
J. Huang, D. Abadi and K. Ren. Scalable SPARQL Querying of Large RDF
Graphs. VLDB Endowment, Volume 4 (VLDB 2011).
K. Wilkinson, C. Sayers, H. Kuno, and D. Reynolds. E鍖cient RDF Storage
and Retrieval in Jena2. In SWDB, pages 131150.
K. Wilkinson. Jena property table implementation. In SSWS, 2006.
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 19 / 19