Part of the Semantic Web, Ontologies and the Cloud class at The University of Texas at Austin's Computer Science department during Spring 2010 term
1 of 27
More Related Content
Review: Scalable Semantic Web Data Management Using Vertical Partitioning
1. Abadi, Marcus, Madden, Hollenbach
VLDB 2007
Presented by: {Gui}llermo Cabrera
The University of Texas at Austin
2. Problem
Storage Goal
RDBMS use
RDF Physical Organization
Column store vs. Row Store
Materialized Path Expressions
Experiment & Results
Discussion
3. Performance: Self-joins
Many triples
4. Achieve scalability & performance in triple
storage
Survey approaches in RDBMS
Benefits of vertical partition and column
store
5. 1 table with 3 indexed columns?
Multi layer architecture
Translate -> Optimize -> Execute
Mapping tables for long URI and literals
Jena, Oracle, Sesame, 3store (Hyunjun),
Hexastore (Donghyuk)
11. Vertical Partition
n two-column tables, n = # of unique properties
Table sorted by subject
Merge join
13. Advantage
Multi valued attributes supported
No clustering algorithm (Property tables)
Only accessed properties are read
Disadvantage
Use of multiple properties (table joins)
Inserts expensive
14. Triple Store
Property Table
Vertical Partition (Row Store)
Vertical Partition Store (Column Store)
15. Why?
Projection is free
Tuple headers (metadata on row)
35 bytes in Postgres vs. 8 bytes in C-Store
Column oriented compression
Run-length encoding (ex. 1,1,1,2,2 1x3, 2x2)
Optimized merge join
Prefetching
25. Great for reads, writes not considered
What about load times?
Using another benchmark (ex. LUBM)?
Native XML databases for RDF/XML?
Test triple store in Sesame
Editor's Notes
RDF as series of triples SPOPerformance: Self-joins, Low speed (# triples > memory)Need to manage large number of triplesBillion Triple Challenge (semanticweb.org)
Self joins become PROBLEMATIC when the LESS selective the predicates.Mapping table 1 clustered (identifiers) and 1 unclsutered index
Jena2 were first to proposeBasic idea is to cluster properties that tend to be DEFINED together (type title and copyrithg date). Also, LEFT OVER TriplesWhy fewer joins? Self joins on the subject column can be eliminated.Tradeoff narrow tables = less sparse = more tables used; wide table = more space = less joins.
Property may exist in MLTIPLE property class tables Good for reified statements.
Exploit Type propertyReified statements
Object Relational Bag structure
Tuple header dominates size of actual data resulting in table
Multi-valued subjects as multiple rowsNo clustering algorithm
Postgres has 27 byte tuple header, compare 8 byes to 35 bytesMerge join uses prefetching to avoid seeks between columns.
Why? Row store to much overhead on vertical partition
For VP not merge joins.PRECALCULATe these expressions, as 2-column tableGood: inference queries (of form x party of y, y part of z, then x part of z)Bad: many tables
Convert from RDF/XML to triples using REDLAND50 million triples, 221 unique properties, multivalued
Average of 3 runs of the queries.VP and PT factor of 2-3 faster than triple store.C-store is 32 times faster than triple storeQ1: PT and VP identical because use of idealized property tables.Q2: Avoids subject-subject joinsQ3: multiple sequential scans.Q4: High selectivityQ5:
Involves all triples of property TYPE and count of object valuesNo join for Triple storePT and VP have same schema. {Type: subject, object}
1 million to 50 million, run only query 6. linearly except triple storeall joins for this query are linear for vertical partitioningtriple-store sorts the intermediate results after performing the three selections and before performing the merge join
For PT, add new column with MPEFor VP, add add table containing, subject column and a Records:Type object column.
What is purpose of test???
LUBM, universities, departments, students etc.15 MILLION triples
Display list of PROPERTIES defined for resources of "Type -> Text"Multiple sequential scans