ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Institute for Web Science and Technologies
                        University of Koblenz ? Landau, Germany




SPLENDID: SPARQL Endpoint Federation
     Exploiting VOID Descriptions


          Olaf G?rlitz, Steffen Staab
Motivation



    How to access a large number of linked data sources?




WeST Institute                  Olaf G?rlitz
People and Knowledge Networks   COLD 2011, Bonn, Germany   ºÝºÝߣ 2
Data Integration Approaches

           Data Warehouse                                  Link Traversal




?   Efficient query execution                      ?   Live Data Access
?   Complete results                               ?   Flexible / On Demand
?   Data copies                                    ?   Incomplete results
?   Inflexible                                     ?   Biased by starting point

WeST Institute                  Olaf G?rlitz
People and Knowledge Networks   COLD 2011, Bonn, Germany        ºÝºÝߣ 3
Our Approach

                                Data Federation

                                                        Live data access
                                                        Flexible source integration
                                                        Effective query planning
                                                        Complete results


Hypothesis:
Efficient query federation is possible using core Semantic
Web technology (i.e. SPARQL endpoints, VoiD descriptions)


WeST Institute                   Olaf G?rlitz
People and Knowledge Networks    COLD 2011, Bonn, Germany        ºÝºÝߣ 4
VoiD: ?Vocabulary of Interlinked Datasets¡°




                                              }        General Information




                                              }        Basic statistics
                                                       triples = 732744



                                              }        Type statistics
                                                       chebi:Compound = 50477




                                              }        Predicate statistics
                                                       bio:formula = 39555




WeST Institute                  Olaf G?rlitz
People and Knowledge Networks   COLD 2011, Bonn, Germany             ºÝºÝߣ 5
Distributed Query Processing




Contribution:
Apply Best Practices of RDBMS for RDF Federation

                                                           http://code.google.com/p/rdffederator/
WeST Institute                  Olaf G?rlitz
People and Knowledge Networks   COLD 2011, Bonn, Germany           ºÝºÝߣ 6
Query Example



        Which drugs are categorized as micronutrients?




       SELECT??drug??title?WHERE?{
       ???drug?drugbank:drugCategory?category:micronutrient?.
       ???drug?drugbank:casRegistryNumber??id?.
       ???keggDrug?rdf:type?kegg:Drug?.
       ???keggDrug?bio2rdf:xRef??id?.
       ???keggDrug?purl:title??title?.?}
       }




WeST Institute                  Olaf G?rlitz
People and Knowledge Networks   COLD 2011, Bonn, Germany   ºÝºÝߣ 7
Query Processing


          Source Selection             Join Optimization   Query Execution




       SELECT??drug??title?WHERE?{
       ???drug?drugbank:drugCategory?category:micronutrient?.
       ???drug?drugbank:casRegistryNumber??id?.
       ???keggDrug?rdf:type?kegg:Drug?.
       ???keggDrug?bio2rdf:xRef??id?.
       ???keggDrug?purl:title??title?.?}
       }




WeST Institute                  Olaf G?rlitz
People and Knowledge Networks   COLD 2011, Bonn, Germany     ºÝºÝߣ 8
Query Processing


          Source Selection             Join Optimization       Query Execution



       1. Step: Index-based source mapping

       SELECT??drug??title?WHERE?{
       ???drug?drugbank:drugCategory?category:micronutrient?.              ¡ú drugbank
       ???drug?drugbank:casRegistryNumber??id?.                            ¡ú drugbank
       ???keggDrug?rdf:type?kegg:Drug?.                                    ¡ú kegg
       ???keggDrug?bio2rdf:xRef??id?.                                      ¡ú kegg
       ???keggDrug?purl:title??title?.?}                                   ¡ú kegg, dbpedia, Chebi
       }

         predicate-index                                   type-index
         drugbank:drugCategory ¡ú drugbank                  kegg:Drug ¡ú kegg




WeST Institute                  Olaf G?rlitz
People and Knowledge Networks   COLD 2011, Bonn, Germany         ºÝºÝߣ 9
Query Processing


          Source Selection             Join Optimization   Query Execution



       2. Step: Refinement with ASK Queries

       SELECT??drug??title?WHERE?{
       ???drug?drugbank:drugCategory?category:micronutrient?.
       ???drug?drugbank:casRegistryNumber??id?.
       ???keggDrug?rdf:type?kegg:Drug?.
       ???keggDrug?bio2rdf:xRef??id?.
       ???keggDrug?purl:title??title?.?}
       }


        No index for subject / object values



WeST Institute                  Olaf G?rlitz
People and Knowledge Networks   COLD 2011, Bonn, Germany    ºÝºÝߣ 10
Query Processing


          Source Selection             Join Optimization   Query Execution



       3. Step: Grouping Triple Patterns

       SELECT??drug??title?WHERE?{
       ???drug?drugbank:drugCategory?category:micronutrient?.
       ???drug?drugbank:casRegistryNumber??id?.                        } drugbank
       ???keggDrug?rdf:type?kegg:Drug?.
       ???keggDrug?bio2rdf:xRef??id?.                                  } kegg
       ???keggDrug?purl:title??title?.?}                               } kegg, dbpedia, Chebi
       }


        + grouping sameAs patterns



WeST Institute                  Olaf G?rlitz
People and Knowledge Networks   COLD 2011, Bonn, Germany    ºÝºÝߣ 11
Join Order Optimization


          Source Selection             Join Optimization   Query Execution



    Dynamic Programming with statistics-based cost estimation

                                     bind join /
                                     hash join




WeST Institute                  Olaf G?rlitz
People and Knowledge Networks   COLD 2011, Bonn, Germany    ºÝºÝߣ 12
Evaluation


   FedBench Evaluation Suite                                  Measuring
    ? Life Science + Cross Domain Data                        ? #data sources selected
    ? different query characteristics                         ? query execution time


Orthogonal State-of-the-Art approaches:
                       DARQ                AliBaba            FedX              SPLENDID
 Statistics            ServiceDesc         ¨C                  ¨C                 VoiD
 Source                Statistics          All sources        ASK queries       Statistics +
 Selection             (predicates)                                             ASK queries
 Query                 DynProg             Heuristics         Heuristics        DynProg
 Optimization
 Query                 Bind join           Bind join          Bound Join +      Bind Join +
 Execution                                                    parallelization   Hash Join


WeST Institute                     Olaf G?rlitz
People and Knowledge Networks      COLD 2011, Bonn, Germany          ºÝºÝߣ 13
Evaluation: Source Selection


          Source Selection                Join Optimization      Query Execution




                                owl:sameAs                    rdf:type


WeST Institute                     Olaf G?rlitz
People and Knowledge Networks      COLD 2011, Bonn, Germany        ºÝºÝߣ 14
Evaluation: Query Optimization


          Source Selection             Join Optimization   Query Execution




WeST Institute                  Olaf G?rlitz
People and Knowledge Networks   COLD 2011, Bonn, Germany    ºÝºÝߣ 15
Conclusion



                           Publish more VoiD description!



                   VoiD-based query federation is efficient



What next?
? Combination with FedX
? Improving estimation and cost model
? Integrating SPARQL 1.1 features
WeST Institute                  Olaf G?rlitz
People and Knowledge Networks   COLD 2011, Bonn, Germany   ºÝºÝߣ 16

More Related Content

Splendid: SPARQL Endpoint Federation Exploiting VOID Descriptions

  • 1. Institute for Web Science and Technologies University of Koblenz ? Landau, Germany SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions Olaf G?rlitz, Steffen Staab
  • 2. Motivation How to access a large number of linked data sources? WeST Institute Olaf G?rlitz People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 2
  • 3. Data Integration Approaches Data Warehouse Link Traversal ? Efficient query execution ? Live Data Access ? Complete results ? Flexible / On Demand ? Data copies ? Incomplete results ? Inflexible ? Biased by starting point WeST Institute Olaf G?rlitz People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 3
  • 4. Our Approach Data Federation Live data access Flexible source integration Effective query planning Complete results Hypothesis: Efficient query federation is possible using core Semantic Web technology (i.e. SPARQL endpoints, VoiD descriptions) WeST Institute Olaf G?rlitz People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 4
  • 5. VoiD: ?Vocabulary of Interlinked Datasets¡° } General Information } Basic statistics triples = 732744 } Type statistics chebi:Compound = 50477 } Predicate statistics bio:formula = 39555 WeST Institute Olaf G?rlitz People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 5
  • 6. Distributed Query Processing Contribution: Apply Best Practices of RDBMS for RDF Federation http://code.google.com/p/rdffederator/ WeST Institute Olaf G?rlitz People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 6
  • 7. Query Example Which drugs are categorized as micronutrients? SELECT??drug??title?WHERE?{ ???drug?drugbank:drugCategory?category:micronutrient?. ???drug?drugbank:casRegistryNumber??id?. ???keggDrug?rdf:type?kegg:Drug?. ???keggDrug?bio2rdf:xRef??id?. ???keggDrug?purl:title??title?.?} } WeST Institute Olaf G?rlitz People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 7
  • 8. Query Processing Source Selection Join Optimization Query Execution SELECT??drug??title?WHERE?{ ???drug?drugbank:drugCategory?category:micronutrient?. ???drug?drugbank:casRegistryNumber??id?. ???keggDrug?rdf:type?kegg:Drug?. ???keggDrug?bio2rdf:xRef??id?. ???keggDrug?purl:title??title?.?} } WeST Institute Olaf G?rlitz People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 8
  • 9. Query Processing Source Selection Join Optimization Query Execution 1. Step: Index-based source mapping SELECT??drug??title?WHERE?{ ???drug?drugbank:drugCategory?category:micronutrient?. ¡ú drugbank ???drug?drugbank:casRegistryNumber??id?. ¡ú drugbank ???keggDrug?rdf:type?kegg:Drug?. ¡ú kegg ???keggDrug?bio2rdf:xRef??id?. ¡ú kegg ???keggDrug?purl:title??title?.?} ¡ú kegg, dbpedia, Chebi } predicate-index type-index drugbank:drugCategory ¡ú drugbank kegg:Drug ¡ú kegg WeST Institute Olaf G?rlitz People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 9
  • 10. Query Processing Source Selection Join Optimization Query Execution 2. Step: Refinement with ASK Queries SELECT??drug??title?WHERE?{ ???drug?drugbank:drugCategory?category:micronutrient?. ???drug?drugbank:casRegistryNumber??id?. ???keggDrug?rdf:type?kegg:Drug?. ???keggDrug?bio2rdf:xRef??id?. ???keggDrug?purl:title??title?.?} } No index for subject / object values WeST Institute Olaf G?rlitz People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 10
  • 11. Query Processing Source Selection Join Optimization Query Execution 3. Step: Grouping Triple Patterns SELECT??drug??title?WHERE?{ ???drug?drugbank:drugCategory?category:micronutrient?. ???drug?drugbank:casRegistryNumber??id?. } drugbank ???keggDrug?rdf:type?kegg:Drug?. ???keggDrug?bio2rdf:xRef??id?. } kegg ???keggDrug?purl:title??title?.?} } kegg, dbpedia, Chebi } + grouping sameAs patterns WeST Institute Olaf G?rlitz People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 11
  • 12. Join Order Optimization Source Selection Join Optimization Query Execution Dynamic Programming with statistics-based cost estimation bind join / hash join WeST Institute Olaf G?rlitz People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 12
  • 13. Evaluation FedBench Evaluation Suite Measuring ? Life Science + Cross Domain Data ? #data sources selected ? different query characteristics ? query execution time Orthogonal State-of-the-Art approaches: DARQ AliBaba FedX SPLENDID Statistics ServiceDesc ¨C ¨C VoiD Source Statistics All sources ASK queries Statistics + Selection (predicates) ASK queries Query DynProg Heuristics Heuristics DynProg Optimization Query Bind join Bind join Bound Join + Bind Join + Execution parallelization Hash Join WeST Institute Olaf G?rlitz People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 13
  • 14. Evaluation: Source Selection Source Selection Join Optimization Query Execution owl:sameAs rdf:type WeST Institute Olaf G?rlitz People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 14
  • 15. Evaluation: Query Optimization Source Selection Join Optimization Query Execution WeST Institute Olaf G?rlitz People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 15
  • 16. Conclusion Publish more VoiD description! VoiD-based query federation is efficient What next? ? Combination with FedX ? Improving estimation and cost model ? Integrating SPARQL 1.1 features WeST Institute Olaf G?rlitz People and Knowledge Networks COLD 2011, Bonn, Germany ºÝºÝߣ 16

Editor's Notes

  • #3: Pre-selected linked datasets Transparent query federation