際際滷

際際滷Share a Scribd company logo
Institute for Web Science and Technologies
                       University of Koblenz  Landau, Germany




                 Systematic Generation of
              SPARQL Benchmark Queries
                    for Linked Open Data



Olaf G旦rlitz, Matthias Thimm, Steffen Staab
Linked Data Federation


            SPARQL Queries on the Linked Data Cloud




                                                 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/



ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 2                       Olaf G旦rlitz, Matthias Thimm, Steffen Staab
The Problem



              Why not use
              benchmark
              queries?




              distributed                                       federation
                queries                                       implementation


ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 3                       Olaf G旦rlitz, Matthias Thimm, Steffen Staab
RDF Benchmarks


       LUBM, BSBM, SP族B, ...                                FedBench (ISWC'11)

        Synthetic datasets                                   10 Linked Data sets
        Domain-specific                                       (~170M triples)
        Highly structured                                    25 handpicked
        Sophisticated queries                                 distributed queries

                Centralized                                              Fixed



                              Scalable, Flexible, Expressive
                                Linked Data Benchmark


ISWC'12, Boston, 11/15/2012     SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 4                         Olaf G旦rlitz, Matthias Thimm, Steffen Staab
Overview

 Benchmark Idea
 Methodology
 Evaluation




ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 5                       Olaf G旦rlitz, Matthias Thimm, Steffen Staab
Linked Data Benchmark Features



         Scalability                      Flexibility                      Expressiveness

  Real Linked Data Sets                  Customization               Typical+Complex Queries




       Systematic SPARQL Benchmark Query Generator
                     for Linked Open Data




ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 6                       Olaf G旦rlitz, Matthias Thimm, Steffen Staab
Requirements



          What we want:

          1. Define Query                                   Customize Benchmark
             Characteristics
          2. Automatic Query                                Random Queries
             Generation
          3. Query Validation                               #results > 0




ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 7                       Olaf G旦rlitz, Matthias Thimm, Steffen Staab
Contribution

                              Methodology and toolset for
                              systematic query generation


                                          Linked Data




                 Config                                                      Benchmark
                                                                              Queries




        Parameterization              Query Generation                 Query Validation



ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 8                       Olaf G旦rlitz, Matthias Thimm, Steffen Staab
Overview

 Benchmark Idea
 Methodology
 Evaluation




ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 9                       Olaf G旦rlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology

                     Query                       Query                    Query
                 Parameterization              Generation               Validation



                Define typical + challenging distributed queries

               No federation query                                 Analyze queries
                 logs available                                    of benchmarks

                 SELECT?drug?keggUrl?chebiImageWHERE{
                 ?drugrdf:typedrugbank:drugs.
                 ?drugdrugbank:keggCompoundId?keggDrug.
                 ?keggDrugbio2rdf:url?keggUrl.
                 ?drugdrugbank:genericName?drugBankName.
                 ?chebiDrugpurl:title?drugBankName.
                 ?chebiDrugchebi:image?chebiImage.}
                FedBench/LifeScience#5
ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 10                      Olaf G旦rlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology

                     Query                       Query                    Query
                 Parameterization              Generation               Validation




    Algebra                               Structure                          Cardinality
   Query Form                           Variable Patterns                  # Data Sources
    (Select, Construct, ...)              (s, o, s+o, ...)
   Join Type                            Join Patterns                      # Joins/ Patterns
    (conj. / disj. / left-join)           (star, path)
   Result Modifiers                     Cross Product                      # Results
    (limit, offs, order by)




ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 11                      Olaf G旦rlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology

                     Query                       Query                    Query
                 Parameterization              Generation               Validation



      Main query parameter: join structure
                                                                  path join




    FedBench queries                                              star join


ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 12                      Olaf G旦rlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology

                     Query                       Query                    Query
                 Parameterization              Generation               Validation



      Additional query parameters: # triple patterns
                                   # data sources
                                   result size
                                   ...


     Path-join: n triple patterns,                      Star-join:        n triple pattern,
                m sources (mn)                                           anchor node (s/o)




ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 13                      Olaf G旦rlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology

                     Query                        Query                      Query
                 Parameterization               Generation                 Validation

                                                       s      rdf:type
                                                  m eA
                                              l:sa
                                         ow                  rdfs:label

                                                           foaf
                                                                  :kno
                                                                      ws


        Iteratively add random triple pattern                                  #results > 0 ?

             Need background knowledge                                         level of detail?

                  Predicate combinations                                       how provided?

ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 14                      Olaf G旦rlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology

                     Query                        Query                        Query
                 Parameterization               Generation                   Validation

                                                       s       rdf:type
                                                  m eA
                                              l:sa
                                         ow                    rdfs:label

                                                             foaf
                                                                    :kno
                                                                        ws


    Linked Predicates                                      Characteristics Sets*
    (owl:sameAs  rdf:type)                                {rdfs:label, foaf:knows, }
    DBpedia  geonames (43, 58)                            DBpedia (322), rdfs:label (437)
    freebase  DBpedia (86, 72)                                               foaf:knows (322)
    ...                                                    ...
                                                           *[Neumann, Moerkotte, ICDE 2011]
ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 15                      Olaf G旦rlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology

                     Query                       Query                    Query
                 Parameterization              Generation               Validation




                               p1               p2              p3

                              p4


    Linked Predicates                                   Characteristics Sets

   (p1  p2)  (p2  p3)                                 {p1, p4}
                      (p3  pi )                        {p1, p4, ...}


ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 16                      Olaf G旦rlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology

                     Query                       Query                    Query
                 Parameterization              Generation               Validation



                Verify generated queries (#results >0)

                How to evaluate?                                      Compute
                                                                   confidence value


                              minimum join selectivity > e




ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 17                      Olaf G旦rlitz, Matthias Thimm, Steffen Staab
Overview

 Benchmark Idea
 Methodology
 Evaluation




ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 18                      Olaf G旦rlitz, Matthias Thimm, Steffen Staab
Evaluation Objective

 Verify generation of valid queries (#results >0)
 Compare variations of query generation algorithms


               Baseline                SPLODGElite                        SPLODGE
             random                    background                      + minimum
             predicate                    knowlege                      join selectivity
                                                                        (> 10-4/10-3/10-2)

 Metrics:
   #queries with non-empty results
   #result per query


ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 19                      Olaf G旦rlitz, Matthias Thimm, Steffen Staab
Evaluation Setup

 Real Linked Data                                  Billion Triple Challenge Dataset
 Random queries
 Triple Store                                       Path-joins across data sources
                                                     3-6 patterns, bound predicates
                                                     100 queries per batch
                                  RDF3X



    SELECT * WHERE {
       ?var1 <http://dbpedia.org/property/description> ?var2 .
       ?var2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?var3 .
       ?var3 <http://www.w3.org/2002/07/owl#disjointWith> ?var4 .
       ?var4 <http://www.w3.org/2002/07/owl#disjointWith> ?var5 .
       ?var5 <http://semantic-mediawiki.org/swivt/1.0#wikiPageModificationDate> ?var6
    }


ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 20                      Olaf G旦rlitz, Matthias Thimm, Steffen Staab
Evaluation Results
#queries




                                          Joined triple patterns


ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 21                      Olaf G旦rlitz, Matthias Thimm, Steffen Staab
Evaluation Results
#results




                                          Joined triple patterns


ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 22                      Olaf G旦rlitz, Matthias Thimm, Steffen Staab
Estimated vs. actual results size
actual result size




                                          estimated result size


ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 23                      Olaf G旦rlitz, Matthias Thimm, Steffen Staab
Predicate Occurrence in Queries




ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 24                      Olaf G旦rlitz, Matthias Thimm, Steffen Staab
Conclusion

SPLODGE provides
 Flexible query characterization + parameterization
 Methodology for Systematic & Scalable Query Generation
 Toolset as Open Source (http://code.google.com/p/splodge/)

Future Work:
 Create a LOD Federation Benchmark
 Interactive SPARQL query construction


                                    Questions?


ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 25                      Olaf G旦rlitz, Matthias Thimm, Steffen Staab
SPLODGE Evaluation Setup

BTC 2011 dataset in RDF3X
 pure triples, no context
 160 GB repository file
  (14h loading, 200 GB tmp mem)




ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 26                      Olaf G旦rlitz, Matthias Thimm, Steffen Staab
SPLODGE Pre-Processing for BTC data


                               Identify common domains
17 GB gzip               (e.g. jane08.lifejournal.com/home)                           3,0 h

                                   Replace quad context
                                                                                      4,4 h
                                (reduce number of sources)

                              Sort quads + remove duplicates                          8,5 h

<1 MB gzip                    Build predicate/context dictionary                      1,0 h

1.7 GB gzip                   Create resource in/out-link index                       9,7 h


   Create linked predicate stats                        Compute characteristic sets           1,6 h

ISWC'12, Boston, 11/15/2012      SPLODGE: Systematic LOD Benchmark Query Generation
際際滷 27                         Olaf G旦rlitz, Matthias Thimm, Steffen Staab

More Related Content

SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

  • 1. Institute for Web Science and Technologies University of Koblenz Landau, Germany Systematic Generation of SPARQL Benchmark Queries for Linked Open Data Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 2. Linked Data Federation SPARQL Queries on the Linked Data Cloud Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 2 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 3. The Problem Why not use benchmark queries? distributed federation queries implementation ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 3 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 4. RDF Benchmarks LUBM, BSBM, SP族B, ... FedBench (ISWC'11) Synthetic datasets 10 Linked Data sets Domain-specific (~170M triples) Highly structured 25 handpicked Sophisticated queries distributed queries Centralized Fixed Scalable, Flexible, Expressive Linked Data Benchmark ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 4 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 5. Overview Benchmark Idea Methodology Evaluation ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 5 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 6. Linked Data Benchmark Features Scalability Flexibility Expressiveness Real Linked Data Sets Customization Typical+Complex Queries Systematic SPARQL Benchmark Query Generator for Linked Open Data ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 6 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 7. Requirements What we want: 1. Define Query Customize Benchmark Characteristics 2. Automatic Query Random Queries Generation 3. Query Validation #results > 0 ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 7 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 8. Contribution Methodology and toolset for systematic query generation Linked Data Config Benchmark Queries Parameterization Query Generation Query Validation ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 8 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 9. Overview Benchmark Idea Methodology Evaluation ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 9 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 10. SPLODGE Methodology Query Query Query Parameterization Generation Validation Define typical + challenging distributed queries No federation query Analyze queries logs available of benchmarks SELECT?drug?keggUrl?chebiImageWHERE{ ?drugrdf:typedrugbank:drugs. ?drugdrugbank:keggCompoundId?keggDrug. ?keggDrugbio2rdf:url?keggUrl. ?drugdrugbank:genericName?drugBankName. ?chebiDrugpurl:title?drugBankName. ?chebiDrugchebi:image?chebiImage.} FedBench/LifeScience#5 ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 10 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 11. SPLODGE Methodology Query Query Query Parameterization Generation Validation Algebra Structure Cardinality Query Form Variable Patterns # Data Sources (Select, Construct, ...) (s, o, s+o, ...) Join Type Join Patterns # Joins/ Patterns (conj. / disj. / left-join) (star, path) Result Modifiers Cross Product # Results (limit, offs, order by) ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 11 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 12. SPLODGE Methodology Query Query Query Parameterization Generation Validation Main query parameter: join structure path join FedBench queries star join ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 12 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 13. SPLODGE Methodology Query Query Query Parameterization Generation Validation Additional query parameters: # triple patterns # data sources result size ... Path-join: n triple patterns, Star-join: n triple pattern, m sources (mn) anchor node (s/o) ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 13 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 14. SPLODGE Methodology Query Query Query Parameterization Generation Validation s rdf:type m eA l:sa ow rdfs:label foaf :kno ws Iteratively add random triple pattern #results > 0 ? Need background knowledge level of detail? Predicate combinations how provided? ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 14 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 15. SPLODGE Methodology Query Query Query Parameterization Generation Validation s rdf:type m eA l:sa ow rdfs:label foaf :kno ws Linked Predicates Characteristics Sets* (owl:sameAs rdf:type) {rdfs:label, foaf:knows, } DBpedia geonames (43, 58) DBpedia (322), rdfs:label (437) freebase DBpedia (86, 72) foaf:knows (322) ... ... *[Neumann, Moerkotte, ICDE 2011] ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 15 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 16. SPLODGE Methodology Query Query Query Parameterization Generation Validation p1 p2 p3 p4 Linked Predicates Characteristics Sets (p1 p2) (p2 p3) {p1, p4} (p3 pi ) {p1, p4, ...} ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 16 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 17. SPLODGE Methodology Query Query Query Parameterization Generation Validation Verify generated queries (#results >0) How to evaluate? Compute confidence value minimum join selectivity > e ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 17 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 18. Overview Benchmark Idea Methodology Evaluation ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 18 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 19. Evaluation Objective Verify generation of valid queries (#results >0) Compare variations of query generation algorithms Baseline SPLODGElite SPLODGE random background + minimum predicate knowlege join selectivity (> 10-4/10-3/10-2) Metrics: #queries with non-empty results #result per query ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 19 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 20. Evaluation Setup Real Linked Data Billion Triple Challenge Dataset Random queries Triple Store Path-joins across data sources 3-6 patterns, bound predicates 100 queries per batch RDF3X SELECT * WHERE { ?var1 <http://dbpedia.org/property/description> ?var2 . ?var2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?var3 . ?var3 <http://www.w3.org/2002/07/owl#disjointWith> ?var4 . ?var4 <http://www.w3.org/2002/07/owl#disjointWith> ?var5 . ?var5 <http://semantic-mediawiki.org/swivt/1.0#wikiPageModificationDate> ?var6 } ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 20 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 21. Evaluation Results #queries Joined triple patterns ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 21 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 22. Evaluation Results #results Joined triple patterns ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 22 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 23. Estimated vs. actual results size actual result size estimated result size ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 23 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 24. Predicate Occurrence in Queries ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 24 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 25. Conclusion SPLODGE provides Flexible query characterization + parameterization Methodology for Systematic & Scalable Query Generation Toolset as Open Source (http://code.google.com/p/splodge/) Future Work: Create a LOD Federation Benchmark Interactive SPARQL query construction Questions? ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 25 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 26. SPLODGE Evaluation Setup BTC 2011 dataset in RDF3X pure triples, no context 160 GB repository file (14h loading, 200 GB tmp mem) ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 26 Olaf G旦rlitz, Matthias Thimm, Steffen Staab
  • 27. SPLODGE Pre-Processing for BTC data Identify common domains 17 GB gzip (e.g. jane08.lifejournal.com/home) 3,0 h Replace quad context 4,4 h (reduce number of sources) Sort quads + remove duplicates 8,5 h <1 MB gzip Build predicate/context dictionary 1,0 h 1.7 GB gzip Create resource in/out-link index 9,7 h Create linked predicate stats Compute characteristic sets 1,6 h ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation 際際滷 27 Olaf G旦rlitz, Matthias Thimm, Steffen Staab