�ݺ�ߣ

Data model for analysis of scholarly documents in the
MapReduce paradigm

Adam Kawa Lukasz Bolikowski Artur Czeczko Piotr Jan Dendek
Dominika Tkaczyk

Centre for Open Science (CeON), ICM UW

Warsaw, July 6, 2012

(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 1 / 19

Agenda

1 Problem deﬁnition
2 Requirements speciﬁcation
3 Exemplary solutions based on Apache Hadoop Ecosystem tools


The data that we are in possession

Vast collections of scholarly documents to store
10 million of full texts
(PDF, plain text)

17 million of document metadata records
(described in XML-based BWMeta format)

4TB of data
(10TB including data archives)


The tasks that we are doing

Big knowledge to extract and discover
17 million of document metadata records (XML)
contain title, subtitles, abstract, keywords, references, contributors and
their affiliations, publishing magazine, . . .
input for many state-of-the-art machine learning algorithms
relatively simple ones: searching documents with given title, finding
scientific teams, . . .
quite complex ones: author name disambiguation, bibliometrics,
classification code assignment, . . .


The requirements that we have specified

Multiple demands regarding storage and processing of large amounts of data:

scalability and parallelism — easily handle tens of terabytes of data and
parallelize the computation effectively
flexible data model — possibility to add or update data, and enhance it
content by implicit information discovered by our algorithms
latency requirements — support batch offline processing as well as
random, realtime read/write requests
availability of many clients — accessible by programmers and researchers
with diverse language preferences and expertise
reliablility and cost-effectiveness — ideally an open-source software which
does not require expensive hardware


Document-related data as linked data
Information about document-related resources can be simply described a directed
labeled graph

entities (e.g. documents, contributors, references) are nodes in the graph
relationships between entities are directed labeled edges in the graph


Linked graph as a collection of RDF triples
A directed labeled graph can be simply represented a collection of RDF triples

a triple consists of subject, predicate and object
a triple represents a statement which denotes that a resource (subject) holds
a value (object) for some attribute (predicate) of that resource
a triple can represent any statements about any resource


Hadoop as a solution for scalability/performance issues
Apache Hadoop is most commonly used open-source solution for storing and
processing big data in reliable, high-performance and cost-effective way.

Scalable storage
Parallel processing
Subprojects and many Hadoop-related projects
HDFS — distributed file system that provides high-throughput access
to large data
MapReduce — framework for distributed processing of large data sets
(Java and e.g. JavaScript, Python, Perl, Ruby via Streaming)
HBase — scalable, distributed data store with flexible schema, random
read/write access and fast scans
Pig/Hive — higher-level abstractions on top of MapReduce (simple
data manipulation languages)


Apache Hadoop Ecosystem tools as RDF triple stores

SHARD [3] — a Hadoop backed RDF triple store
stores triples in flat files in HDFS
data cannot be modified randomly
less efficient for queries that requires the inspection of only a small
number of triples
PigSPARQL [6] — translates SPARQL queries to Pig Latin programs and
runs them on Hadoop cluster
stores RDF triples with the same predicate in separate, flat files in
HDFS
H2RDF [5] — a RDF store that combines MapReduce with HBase
stores triples in HBase using three flat-wide tables
Jena-HBase [4] — a HBase backed RDF triple store
provides six different pluggable HBase storage layouts


HBase as storage layer for RDF triples

Storing RDF triples in Apache HBase has several advantages

flexible data model — columns can be dynamically added and removed;
multiple versions of data in a particular cell; data serialized to a byte array
random read and write — more suitable for semi-structured RDF data
than HDFS where files cannot be modified randomly and usually whole file
must be read sequentially to find subset of records
availability of many clients
interactive clients — native Java API, REST or Apache Thrift
batch clients — MapReduce (Java), Pig (PigLatin) and Hive
(HiveQL)
automatically sorted records — quick lookups and partial scans; joins as
fast (linear) merge-joins


Exemplary HBase schema — Flat-wide layout

Advantages
no prior knowledge about data is required
colocation of all information about a resource within a single row
support of multi-valued properties
support of reiﬁed statements (statements about statements)
Disadvantages
unlimited number of columns
increase of storage space


Exemplary HBase schema - Vertically Partitioned layout [1]

Advantages
storage space savings when compared to the previous layout
ﬁrst-step (predicate-bound) pairwise joins as fast merge-joins
Disadvantages
increased number of joins


Exemplary HBase schema - Hexastore layout [2]
Advantages
ﬁrst-step pairwise joins as fast merge-joins
Disadvantages
increased of number of joins
increase of storage space
complication of an update operation


HBase schema - other layout

Some derivative and hybrid layouts exist to combine the advantages of original
layouts
a combination of the vertically partitioned and the hexastore layout [4]
a combination of the ﬂat-wide and the vertically partitioned layouts [4]


Challenges

a large number of join operations
relatively expensive
and practically cannot be avoided (at least for more complex queries)
but specialized join techniques can be used e.g. multi join, merge-sort
join, replicated join, skewed join
lack of a native support for cross-row atomicity (e.g. in the form of
transactions)


Possible performance optimization techniques

property tables — properties often queried together are stored in the same
record for a quick access [8, 9]
materialized path expressions — precalculation and materialization of
the most commonly used paths through an RDF graph in advance [1, 2]
graph-oriented partitioning scheme [7]
take advantage of the spatial locality inherent in graph pattern
matching
higher replication of data that is on the border of any particular
partition (however, problematic for a graph that is modiﬁed)


The ways of processing data from HBase

Many various tools are integrated with HBase and can read data from and write
data to HBase tables
Java MapReduce
possibility to use our legacy Java code in map and reduce methods
delivers better perfromance than Apache Pig
Apache Pig
provides common data operations (e.g. ﬁlters, unions, joins, ordering)
and nested types (e.g. tuples, bags, maps)
supports multiple specialized joins implementation
possibility to run MapReduce jobs directly from PigLatin scripts
can be embeded in Python code
Interactive clients (e.g. Java API, REST or Apache Thrift)
interactive access to relatively small subset of our data by sending API
calls on demand e.g. a web-based client


Case study: author name disambiguation algorithm

The most complex algorithm that we have run over Apache HBase so far is
author name disambiguation algorithm.


Thanks!

More information about CeON:
http://ceon.pl/en/research

c 2012 Adam Kawa. Ten dokument jest dostepny na licencji Creative Commons Uznanie autorstwa 3.0 Polska
Tre´´ licencji dostepna pod adresem: http://creativecommons.org/licenses/by/3.0/pl/
sc


D. J. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach. Scalable
Semantic Web Data Management using vertical partitioning. In VLDB, pages
411–422, 2007.

C. Weiss, P. Karras, and A. Bernstein. Hexastore: Sextuple Indexing for
Semantic Web Data Management. In VLDB, pages 1008-1019, 2008.

K. Rohloff and R. Schantz. High-performance, massively scalable distributed
systems using the mapreduce software framework: The shard triple-store.
International Workshop on Programming Support Innovations for Emerging
Distributed Applications, 2010.

V. Khadilkar, M. Kantarcioglu, P. Castagna, and B. Thuraisingham.
Jena-HBase: A Distributed, Scalable and Efffient RDF Triple Store. Technical
report, 2012. http://www.utdallas.edu/ vvk072000/Research/Jena-HBase-
Ext/tech-report.pdf

N. Papailiou, I. Konstantinou, D. Tsoumakos and N. Koziris. H2RDF:
Adaptive Query Processing on RDF Data in the Cloud. In Proceedings of the
21th International Conference on World Wide Web (WWW demo track),
Lyon, France, 2012.

A. Sch¨tzle, M. Przyjaciel-Zablocki and G. Lausen: PigSPARQL: Mapping
a
SPARQL to Pig Latin. 3th International Workshop on Semantic Web
Information Management (SWIM 2011), in conjunction with the 2011 ACM
International Conference on Management of Data (SIGMOD 2011). Athens
(Greece).

J. Huang, D. Abadi and K. Ren. Scalable SPARQL Querying of Large RDF
Graphs. VLDB Endowment, Volume 4 (VLDB 2011).

K. Wilkinson, C. Sayers, H. Kuno, and D. Reynolds. Eﬃcient RDF Storage
and Retrieval in Jena2. In SWDB, pages 131–150.

K. Wilkinson. Jena property table implementation. In SSWS, 2006.


�ݺ�ߣ

Data model for analysis of scholarly documents in the MapReduce paradigm

More Related Content

Data model for analysis of scholarly documents in the MapReduce paradigm