The document discusses strategies for distributed query processing and explanation on linked data. It describes techniques for optimizing queries across multiple linked data sources, including parsing queries into subqueries, source selection, evaluation methods, and optimization approaches like exclusive grouping and bound joins. It also discusses explaining both the query plan by showing how a query was processed and the results by providing why and where provenance.
1 of 52
Downloaded 14 times
More Related Content
Strategies for Processing and Explaining Distributed Queries on Linked Data
1. Strategies for Processing and
Explaining Distributed Queries on
Linked Data
Rakebul Hasan
Wimmics
Inria Sophia Antipolis
2. Research Theme
Distribute Query Processing
Optimization techniques for querying Linked Data
Distributed Query Explanation
Query plan explanation
How query solving works
Query result explanation
Why
Where
1
4. Querying Linked Data
Bottom-up strategies
Discover sources during query processing by
following links between sources
Top-down strategies
Sources are known
3
5. Querying Linked Data
FedBench Query CD5: Find the director and the
genre of movies directed by Italians
SELECT ?film ?director ?genre WHERE {
?film dbpedia-owl:director ?director.
?director dbpedia-owl:nationalitydbpedia:Italy .
?xowl:sameAs ?film .
?xlinkedMDB:genre ?genre
}
4
6. Querying Linked Data
FedBench Query CD5: Find the director and the genre of movies
directed by Italians
SELECT ?film ?director ?genre WHERE {
SERVICE <http://dbpedia.org/sparql>{
?film dbpedia-owl:director ?director.
?director dbpedia-owl:nationalitydbpedia:Italy .
?xowl:sameAs ?film .
}
SERVICE <http://data.linkedmdb.org/sparql> {
?xlinkedMDB:genre ?genre
}
}
SPARQL 1.1 SERVICE clause
Need knowledge of which part of
the query should be solved by which endpoint
5
8. Querying Linked Data
Data warehousing approach
Collect the data in a central triple store
Process queries on that triple store
Disadvantages
Expensive preprocessing (data collection + integration)
and maintenance
Data not up to date
7
9. Querying Linked Data
Virtual integration approach
A query federation middleware
Parse and split into subqueries
Source selection for subqueries
Evaluate the subqueries against corresponding sources
directly
Merge the results
Advantages
no preprocessing and maintenance
up to date data
8
10. Running Example
Query LS6: Find KEGG drug names of all drugs in
Drugbank belonging to category Micronutrient
drugbank
http://www4.wiwiss.fu-berlin.de/drugbank/sparql
SELECT ?drug ?title WHERE {
?drug drugbank:drugCategorydrugbank-category:micronutrient .
?drug drugbank:casRegistryNumber ?id .
?keggDrugrdf:typekegg:Drug .
?keggDrug bio2rdf:xRef ?id .
?keggDrugpurl:title ?title .
}
KEGG
http://cu.bioportal.bio2rdf.org/sparql
DBpedia
http://data.linkedmdb.org/sparql
9
11. Parsing and Source Selecting
drugbank
Query LS6: Find KEGG drug names of all drugs in
Drugbank belonging to category Micronutrient
http://www4.wiwiss.fu-berlin.de/drugbank/sparql
?drug drugbank:drugCategorydrugbank-category:micronutrient
?drug drugbank:casRegistryNumber ?id
SELECT ?drug ?title WHERE {
?drug drugbank:drugCategorydrugbank-category:micronutrient .
?drug drugbank:casRegistryNumber ?id .
?keggDrugrdf:typekegg:Drug .
?keggDrug bio2rdf:xRef ?id .
?keggDrugpurl:title ?title .
}
KEGG
http://cu.bioportal.bio2rdf.org/sparql
?keggDrugrdf:typekegg:Drug
?keggDrug bio2rdf:xRef ?id
Send ask queries to all the sources
ASK {?drug drugbank:drugCategorydrugbank-category:micronutrient }
ASK {?drug drugbank:casRegistryNumber ?id }
ASK {?keggDrugrdf:typekegg:Drug }
ASK {?keggDrug bio2rdf:xRef ?id }
ASK {?keggDrugpurl:title ?title }
DBpedia
http://data.linkedmdb.org/sparql
?keggDrugpurl:title ?title
10
12. Evaluating subqueries
Two options
All triple patterns are individually evaluated
Nested loop join (NLJ): evaluate iteratively pattern
by pattern
11
13. Evaluating subqueries
Example of NLJ
SELECT ?drug ?title WHERE {
?drug drugbank:drugCategorydrugbank-category:micronutrient .
?drug drugbank:casRegistryNumber ?id .
?keggDrugrdf:typekegg:Drug .
?keggDrug bio2rdf:xRef ?id .
?keggDrugpurl:title ?title .
}
12
16. Optimization techniques
Bound join
Effective for NLJ
Mappings of variable values from the
intermediate result to the next subquery
SPARQL 1.1 VALUES clause
15
20. Optimization techniques
Join order
Selectivity estimation: join order heuristic based
on the number of bound and free variables [1]
Statistics: based on cardinalities of the triple
patterns [2]
[1] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and D. Reynolds. SPARQL basic graph pattern optimization using selectivity estimation.
In Proceedings of the 17th international conference on World Wide Web (WWW '08)
[2] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K. Sattler, and J. Umbrich. Data summaries for on-demand queries over linked data.
In Proceedings of the 19th international conference on World wide web (WWW '10)
19
21. Selectivity Estimation
Ideal for the Linked Data scenario
No need for statistics about the underlying data
Estimations can be wrong however
BGP
Number of results
Selectivity
estimation
?v1 p1 ?v2
?v1 p2 ?v3
5
2
?v4 p3 c1
?v4 p4 ?v1
10
1
20
22. Query Performance Prediction
Learn query performance from already
executed queries
Statistics about the underlying data not
required
Applications
Query optimization: join order
Workload management/scheduling
21
23. Regression
Learn a mapping function
f(X) = y
X = feature vector, vector representation of SPARQL query
y = performance metric (e.g. latency, number of results)
Support vector machine with nu-SVR
k-nearest neighbors regression
22
29. Graph Pattern Features
Clustering Training Queries
K-mediods clustering algorithm with
approximated edit distance as distance function
Selects data points as cluster centers
Arbitrary distance function
28
30. Graph Pattern Features
Graph Edit Distance
Minimum amount of distortion needed to
transform one graph to another
Compute similarity by inversing distance
29
31. Graph Pattern Features
Graph Edit Distance
Usually computed using A* search
Exponential running time
Bipartite matching based approximated graph edit
distance with
Previous research shows accurate results with
classification problems
30
32. Experiment Setup
Triple store and hardware
Jena TDB 1.0.0
16 GB memory
2.53 GHz CPU
48 GB system RAM
Linux 2.6.32 operating system
31
33. Experiment Setup
Datasets
Training, validation, test datasets generated from
25 DBPSB query templates
1260 training queries
420 validation queries
420 test queries
RDF data: DBpedia 3.5.1 with 100% scaling factor
from DBPSB framework
32
34. Prediction Models
Support Vector Machine (SVM) with the nuSVR kernel for regression
k-nearest neighbors (k-NN) regression
Euclidean distance as the distance function
k-dimensional tree (k-d tree) data structure to
compute the nearest neighbors
33
38. Whats Next
Apply QPP in join order optimization
Benchmarking
FedBench: a benchmarking framework for
federated query processing
Systematic generation of training queries
Bootstrapping
Refining training queries from query logs
37
39. Statistical Analysis of Query Logs
Approach to Systematic generation of training queries
Mario Arias, Javier D. Fern叩ndez, Miguel A. Mart鱈nez-Prieto, Pablo de la Fuente: An Empirical Study of Real-World SPARQL Queries,
1st International Workshop on Usage Analysis and the Web of Data,
co-located with the 20th International World Wide Web Conference (WWW2011)
38
42. Query Explanation
Query plan explanation
Query result explanation
Motivation
Understanding
Transparency
Trust
41
43. Query Explanation in the Semantic
Web
Query plan explanation
Query result explanation
Jena
Sesame
Virtuoso
*
BigOWLIM
+
* Explanation of SPARQL to SQL translation
+ a debugging feature on the query engine side
42
44. Distributed Query Explanation in the
Semantic Web
Query plan explanation
Query result explanation
FedX
DARQ
SemWIQ
ADERIS
Anapsid
43
46. Related Work
Database: Why, how, where provenance
Semantic Web:
Inference explanation
Provenance
Generating justifications
45
47. Distributed Query Explanation
What We Provide
Query plan explanation Work in progress
Prior to query execution
Includes predicted performance metrics
After query execution with performance metrics
Query result explanation
Why provenance
Where provenance
46
48. Query Result Explanation
(RDF model, Query, Result)
Result Explainer
Plug-in
Can be any RDF model (RDF graph, SPARQL endpoint)
We generate result explanations by querying this model for why provenance
47
49. Why Provenance
Triples in virtual model from which the result
is derived
<http://example.org/book/book8>
<http://purl.org/dc/elements/1.1/creator>
<http://www-sop.inria.fr/members/Alice> ;
<http://purl.org/dc/elements/1.1/title>
"Distributed Query Processing for Linked Data" .
<http://www-sop.inria.fr/members/Charlie>
<http://xmlns.com/foaf/0.1/name>
"Charlie" .
<http://www-sop.inria.fr/members/Alice>
<http://xmlns.com/foaf/0.1/knows>
<http://www-sop.inria.fr/members/Charlie> ;
<http://xmlns.com/foaf/0.1/name>
"Alice" .
48
50. Where Provenance
Keep the provenance of sources in the virtual model
fedqld:source1 {
<http://example.org/book/book8>
<http://purl.org/dc/elements/1.1/creator>
<http://www-sop.inria.fr/members/Alice> ;
<http://purl.org/dc/elements/1.1/title>
"Distributed Query Processing for Linked Data" .
}
fedqld:source2 {
<http://www-sop.inria.fr/members/Charlie>
<http://xmlns.com/foaf/0.1/name>
"Charlie" .
Where the triples in this
graph come from
<http://www-sop.inria.fr/members/Alice>
<http://xmlns.com/foaf/0.1/knows>
<http://www-sop.inria.fr/members/Charlie> ;
<http://xmlns.com/foaf/0.1/name>
"Alice" .
}
fedqld:prov {
fedqld:source1 void:sparqlEndpoint<http://localhost:3030/books/query> .
fedqld:source2 void:sparqlEndpoint<http://localhost:3031/person/query> .
}
49
52. References
Andreas Schwarte, Peter Haase, KatjaHoose, Ralf Schenkel, and Michael Schmidt. Fedx: A federation layer for
distributed query processing on linked open data. In ESWC, 2011
Shinji Kikuchi, Shelly Sachdeva, SubhashBhalla, Steven Lynden, Isao Kojima, AkiyoshiMatono, and Yusuke Tanimura.
Adaptive integration of distributed semantic web data. Databases in Networked Information Systems
Michael Schmidt, Olaf G旦rlitz, Peter Haase, G端nter Ladwig, Andreas Schwarte, and Thanh Tran. 2011. FedBench: a
benchmark suite for federated semantic data query processing. In Proceedings of the 10th international
conference on The semantic web - Volume Part I (ISWC'11)
51
Editor's Notes
#2: Overview of distributed query processing, extracting interesting features from sparql queries for machine learning algorithms, explaining sparql queries
#30: The graph edit distance between two graphs is the minimum amount of distortion needed to transform one graph to another.The minimum amount of distortion is the sequence of edit operations with minimum cost. The edit operations are deletions, insertions, and substitutions of nodes and edges.
#35: RMSD represents the sample standard deviation of the differences between predicted values and observed valuesprediction errors when computed out-of-sample
#36: As the figures show, predictions for queries from templates 15, 22, 24, and 25 have large errors. Execution times of the queries from template 15 range from 1 ms to 21 ms, template 22 range from 1 ms to 22 ms, and template 25 range from 1 ms to 24 ms. For this reason, many data points in range of 1 ms to 24 ms are distant from the perfect prediction line in Figure~\ref{fig:all_predictions}(a). Execution times of the queries from template 24 range from 86 ms to 110 ms. Many data points at this interval are at even more distance from the perfect prediction line than the interval 1 ms to 24 ms. This explains the highest error for queries belonging to template 25.
#39: Refining training queries from query logs by considering the statistically significant characteristics Bootstrapping: Starting with a initial set of properties, resources and literals and than generate training queries by permutations and combinations of the statistically significant features Simplifying the pattern features: join features, triple pattern features, pattern graph features to represent query patterns
#46: our goal is not to find the minimal parts of axioms in a justification which are required to hold an entailment.
#49: triples in the virtual model for the query result are causes