This document presents SPLENDID, a system for federated querying across linked data sources. It uses Vocabulary of Interlinked Datasets (VoiD) descriptions to select relevant sources and optimize query planning and execution. The system applies techniques from distributed database systems to federated SPARQL querying, including dynamic programming for join ordering and statistics-based cost estimation. An evaluation using the FedBench suite found it efficiently selects sources and executes queries, outperforming state-of-the-art federated querying systems by leveraging VoiD descriptions and statistics. Future work includes integrating it with other systems and improving its cost models.
1. Institute for Web Science and Technologies
University of Koblenz ? Landau, Germany
SPLENDID: SPARQL Endpoint Federation
Exploiting VOID Descriptions
Olaf G?rlitz, Steffen Staab
2. Motivation
How to access a large number of linked data sources?
WeST Institute Olaf G?rlitz
People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 2
3. Data Integration Approaches
Data Warehouse Link Traversal
? Efficient query execution ? Live Data Access
? Complete results ? Flexible / On Demand
? Data copies ? Incomplete results
? Inflexible ? Biased by starting point
WeST Institute Olaf G?rlitz
People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 3
4. Our Approach
Data Federation
Live data access
Flexible source integration
Effective query planning
Complete results
Hypothesis:
Efficient query federation is possible using core Semantic
Web technology (i.e. SPARQL endpoints, VoiD descriptions)
WeST Institute Olaf G?rlitz
People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 4
5. VoiD: ?Vocabulary of Interlinked Datasets¡°
} General Information
} Basic statistics
triples = 732744
} Type statistics
chebi:Compound = 50477
} Predicate statistics
bio:formula = 39555
WeST Institute Olaf G?rlitz
People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 5
6. Distributed Query Processing
Contribution:
Apply Best Practices of RDBMS for RDF Federation
http://code.google.com/p/rdffederator/
WeST Institute Olaf G?rlitz
People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 6
7. Query Example
Which drugs are categorized as micronutrients?
SELECT??drug??title?WHERE?{
???drug?drugbank:drugCategory?category:micronutrient?.
???drug?drugbank:casRegistryNumber??id?.
???keggDrug?rdf:type?kegg:Drug?.
???keggDrug?bio2rdf:xRef??id?.
???keggDrug?purl:title??title?.?}
}
WeST Institute Olaf G?rlitz
People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 7
8. Query Processing
Source Selection Join Optimization Query Execution
SELECT??drug??title?WHERE?{
???drug?drugbank:drugCategory?category:micronutrient?.
???drug?drugbank:casRegistryNumber??id?.
???keggDrug?rdf:type?kegg:Drug?.
???keggDrug?bio2rdf:xRef??id?.
???keggDrug?purl:title??title?.?}
}
WeST Institute Olaf G?rlitz
People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 8
10. Query Processing
Source Selection Join Optimization Query Execution
2. Step: Refinement with ASK Queries
SELECT??drug??title?WHERE?{
???drug?drugbank:drugCategory?category:micronutrient?.
???drug?drugbank:casRegistryNumber??id?.
???keggDrug?rdf:type?kegg:Drug?.
???keggDrug?bio2rdf:xRef??id?.
???keggDrug?purl:title??title?.?}
}
No index for subject / object values
WeST Institute Olaf G?rlitz
People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 10
11. Query Processing
Source Selection Join Optimization Query Execution
3. Step: Grouping Triple Patterns
SELECT??drug??title?WHERE?{
???drug?drugbank:drugCategory?category:micronutrient?.
???drug?drugbank:casRegistryNumber??id?. } drugbank
???keggDrug?rdf:type?kegg:Drug?.
???keggDrug?bio2rdf:xRef??id?. } kegg
???keggDrug?purl:title??title?.?} } kegg, dbpedia, Chebi
}
+ grouping sameAs patterns
WeST Institute Olaf G?rlitz
People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 11
12. Join Order Optimization
Source Selection Join Optimization Query Execution
Dynamic Programming with statistics-based cost estimation
bind join /
hash join
WeST Institute Olaf G?rlitz
People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 12
13. Evaluation
FedBench Evaluation Suite Measuring
? Life Science + Cross Domain Data ? #data sources selected
? different query characteristics ? query execution time
Orthogonal State-of-the-Art approaches:
DARQ AliBaba FedX SPLENDID
Statistics ServiceDesc ¨C ¨C VoiD
Source Statistics All sources ASK queries Statistics +
Selection (predicates) ASK queries
Query DynProg Heuristics Heuristics DynProg
Optimization
Query Bind join Bind join Bound Join + Bind Join +
Execution parallelization Hash Join
WeST Institute Olaf G?rlitz
People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 13
14. Evaluation: Source Selection
Source Selection Join Optimization Query Execution
owl:sameAs rdf:type
WeST Institute Olaf G?rlitz
People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 14
15. Evaluation: Query Optimization
Source Selection Join Optimization Query Execution
WeST Institute Olaf G?rlitz
People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 15
16. Conclusion
Publish more VoiD description!
VoiD-based query federation is efficient
What next?
? Combination with FedX
? Improving estimation and cost model
? Integrating SPARQL 1.1 features
WeST Institute Olaf G?rlitz
People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 16