際際滷

際際滷Share a Scribd company logo
An early look at the
LDBC Social Network Benchmarks
Business Intelligence workload
G叩bor Sz叩rnyas, Arnau Prat-P辿rez, Alex Averbuch, J坦zsef Marton, Marcus Paradies,
Moritz Kaufmann, Orri Erling, Peter Boncz, Vlad Haprian, J叩nos Benjamin Antal
GRADES-NDA @ SIGMOD
Houston, TX
2
Linked Data Benchmark Council
LDBC is a non-profit organization
dedicated to establishing benchmarks,
benchmark practices and benchmark
results for graph data management SW.
LDBCs Social Network Benchmark is
an industrial and academic initiative,
formed by principal actors in the field of
graph-like data management.
3
LDBC timeline
2012 2013 2014 2015 2016 2017 2018
30+ papers in total,
including G-CORE
(presented on Thursday)
4
Graph processing landscape
Interactive
Graphalytics
BI
local queries
global queries
computations
5
BI global queries
Graph processing landscape
Interactive
Graphalytics
local queries
computations
Example: Recently liked by friends
MATCH
(u:User {id: $uID})-[:FRIEND]-(f:User)-[l:LIKES]->(p:Post)
RETURN f, p
ORDER BY l.timestamp DESC
LIMIT 10
frequent upd.limited data
6
Graph processing landscape
Interactive
Graphalytics
local queries
computations
BI global queries
frequent upd.limited data
Example: One-sided friendships
MATCH (u1:User)-[:FRIEND]-(u2:User)-[l:LIKES]->(p:Post),
(u1)-[:AUTHOR_OF]->(p)
WITH u1, u2, count(l) AS likes
WHERE likes > 10
AND NOT (u1)-[:LIKES]->(:Post)<-[:AUTHOR_OF]-(u2)
RETURN u1, u2
lots of data infreq. upd.
7
Graph processing landscape
Interactive
Graphalytics
BI
local queries
global queries
computations
frequent upd.limited data
lots of data infreq. upd.
Example: Find the most central individuals.
BFS breadth-first search LCC local clustering coefficient
PR PageRank SSSP single-source shortest path
CDLP community detection by label propagation
WCC weakly connected components
all data no upd.
8
Graph processing landscape
all data no upd.
lots of upd.limited data
lots of data few upd.
Interactive
Graphalytics
BI
local queries
global queries
computations
9
Business
Intelligence
amount of data accessed
expectedexecutiontime
Interactive
LDBC benchmarks at a glance
Graphalytics
10
Business
Intelligence
amount of data accessed
expectedexecutiontime
Interactive
LDBC benchmarks at a glance
Graphalytics
Social Network
Benchmark
11
Social network graph
DATAGEN:
 Generate realistic graphs
 Multiple scale factors (SFs)
Nodes:
 Collection attributes
 Type inheritance
Edges:
 With attributes
 Edges between similar nodes
 Network of Persons
 Reply tree of Posts/Comments
12
Business
Intelligence
amount of data accessed
expectedexecutiontime
Interactive
LDBC benchmarks at a glance
Graphalytics
13
Detailed query specifications
14 design starts here
15
Choke points
 = a challenging aspect of query processing, a well-chosen difficulty
 Allows systematic benchmark design
Peter Boncz, Thomas Neumann, Orri Erling,
TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark,
TPCTC 2013
16
Q5: Top posters in a country
1. Find the top 100 Forums by members in a given Country.
2. For each member of the top 100 Forums, count their Posts in the top 100 Forums.
1 2 3 4
4 3 2 1
Forum to Country
Country to Forum
57 s
0.3 s
CP-2.1 Rich join order optimization
Sparksee
SF10
17
Choke points: optimizations
 Top-k pushdown optimization
 New in LDBC (not covered in TPC-H choke points)
18
Q22: International dialog
For each p1-p2 pair, calculate score and get top pair (w/ tie-break)
+ 4 if p1 replied to p2
+ 1 if p2 replied to p1
+15 if p1 knows p2
+10 if p1 liked p2s msg
+ 1 if p2 liked p1s msg
= max. 31 in total
19
Q22: International dialog
Avoiding full Cartesian product with Top-k pushdown:
Example #1:
 There are k pairs with maximum points (31).
 A pair cannot possibly achieve max. points  prune
Example #2:
 There are k pairs with at least 20 points.
 A pair fails the condition for 15 points  prune
+ 4 if p1 replied
+ 1 if p2 replied
+15 if p1 knows p2
+10 if p1 liked
+ 1 if p2 liked
= max. 31 in total
20
Q16: Experts in social circle
 CP-1.3 [QOPT] Top-k pushdown
 CP-7.1 [QEXE] Path pattern reuse
 CP-7.2 [QOPT] Cardinality estimation of transitive paths
 CP-7.3 [QEXE] Execution of a transitive step
Baseline 29 s
Top-k 27 s
Top-k + Path pattern reuse 15 s
Sparksee
run times
on SF10
21
Language choke points
New choke points to cover language features.
 CP-8.1: Complex patterns
 CP-8.2: Complex aggregations
 CP-8.3: Ranking-style queries
 CP-8.4: Query composition
 CP-8.5: Dates and times
 CP-8.6: Handling paths
22
Language choke points
New choke points to cover language features.
 CP-8.1: Complex patterns
 CP-8.2: Complex aggregations
 CP-8.3: Ranking-style queries
 CP-8.4: Query composition
 CP-8.5: Dates and times
 CP-8.6: Handling paths
Q22: select top pair for each city1
LIMIT 1 not sufficient
PostgreSQL: rank()
23
Language choke points
New choke points to cover language features.
 CP-8.1: Complex patterns
 CP-8.2: Complex aggregations
 CP-8.3: Ranking-style queries
 CP-8.4: Query composition
 CP-8.5: Dates and times
 CP-8.6: Handling paths
Q5: top 100 forums
(Important feature of G-CORE.)
24
Language choke points
New choke points to cover language features.
 CP-8.1: Complex patterns
 CP-8.2: Complex aggregations
 CP-8.3: Ranking-style queries
 CP-8.4: Query composition
 CP-8.5: Dates and times
 CP-8.6: Handling paths
Q1: aggregate for each month
Datetime features:
 SQL 
 SPARQL 
 Cypher: recently added
25
Language choke points
New choke points to cover language features.
 CP-8.1: Complex patterns
 CP-8.2: Complex aggregations
 CP-8.3: Ranking-style queries
 CP-8.4: Query composition
 CP-8.5: Dates and times
 CP-8.6: Handling paths
26
Q25: Weighted interaction paths
1. Given two Persons, get all shortest paths on knows edges.
2. For each path, for each edge on the path, calculate a weight.
3. For each path, summarize weights.
4. Return paths, ordered by weights (desc).
(Q25 covers 15 CPs, incl. all language-related ones  its SQL impl. is ~2500 chars)
27
CP-8.6 Handling paths
1. Path unwinding: higher-order queries
Q25: weights based on additional pattern matching on path elements.
2. Matching semantics for paths
 Homomorphism-based (walks)
 Isomorphism-based
 No-repeated-anything
 No-repeated-edge semantics (trails)
 No-repeated-node semantics (simple path)
Q16: social circle  persons connected by an edge-unique paths of [x, y] hops
3. Regular path queries (RPQs)
R. Angles et al.,
Foundations of Modern Query Languages for Graph Databases,
ACM Computing Surveys, 2017
28
CP-8.6 Handling paths
29
Implementing the BI workload
High-level process
1. Generate data set
2. Implement loader
3. Implement driver adapter
4. Implement queries and validate
Very time consuming process, but
 after 2 validated tools  still bugs in both implementations
 after 3 validated tools  still ambiguities in the spec
Validation
1. Generate validation set
2. Cross-validate for multiple SFs
3. Failure  fix issues and go to 2
30
Implementing the BI workload
Cross-validation for implementations
Cypher Neo4j 25/25
SQL PostgreSQL 25/25
Imperative Sparksee 25/25
SPARQL Stardog 24/25
PGQL Oracle Labs PGX 10/25
In progress: Spark SQL
31
Roadmap
1. Help industry adoption  get more benchmark results
2. Define updates on the graph.
Necessitates complex dependency handling (SIGMOD15 paper), and
raises many design choices:
 Affected types which nodes/edges? what distribution?
 Nature of changes append-only vs. insert and delete
 Granularity nodes/edges vs. attributes
 Frequency of changes streaming vs. batch
3. Publish as a conference paper
32
Acknowledgements
G叩bor Sz叩rnyas was partially supported by NSERC RGPIN-04573-16 (Canada) and
the MTA-BME Lend端let Cyber-Physical Systems Research Group (Hungary).
DAMA-UPC research was supported by the grant TIN2017-89244-R from MINECO
(Ministerio de Economia, Industria y Competitividad) and the recognition 2017SGR-856
(MACDA) from AGAUR (Generalitat de Catalunya).
Sparsity thanks the EU H2020 for funding the Uniserver project (ICT-04-2015-688540).
MTA-BME Lend端let
Cyber-Physical Systems Research Group
Department of Measurement
and Information Systems
Department of Electrical and
Computer Engineering

More Related Content

An early look at the LDBC Social Network Benchmark's Business Intelligence workload

  • 1. An early look at the LDBC Social Network Benchmarks Business Intelligence workload G叩bor Sz叩rnyas, Arnau Prat-P辿rez, Alex Averbuch, J坦zsef Marton, Marcus Paradies, Moritz Kaufmann, Orri Erling, Peter Boncz, Vlad Haprian, J叩nos Benjamin Antal GRADES-NDA @ SIGMOD Houston, TX
  • 2. 2 Linked Data Benchmark Council LDBC is a non-profit organization dedicated to establishing benchmarks, benchmark practices and benchmark results for graph data management SW. LDBCs Social Network Benchmark is an industrial and academic initiative, formed by principal actors in the field of graph-like data management.
  • 3. 3 LDBC timeline 2012 2013 2014 2015 2016 2017 2018 30+ papers in total, including G-CORE (presented on Thursday)
  • 5. 5 BI global queries Graph processing landscape Interactive Graphalytics local queries computations Example: Recently liked by friends MATCH (u:User {id: $uID})-[:FRIEND]-(f:User)-[l:LIKES]->(p:Post) RETURN f, p ORDER BY l.timestamp DESC LIMIT 10 frequent upd.limited data
  • 6. 6 Graph processing landscape Interactive Graphalytics local queries computations BI global queries frequent upd.limited data Example: One-sided friendships MATCH (u1:User)-[:FRIEND]-(u2:User)-[l:LIKES]->(p:Post), (u1)-[:AUTHOR_OF]->(p) WITH u1, u2, count(l) AS likes WHERE likes > 10 AND NOT (u1)-[:LIKES]->(:Post)<-[:AUTHOR_OF]-(u2) RETURN u1, u2 lots of data infreq. upd.
  • 7. 7 Graph processing landscape Interactive Graphalytics BI local queries global queries computations frequent upd.limited data lots of data infreq. upd. Example: Find the most central individuals. BFS breadth-first search LCC local clustering coefficient PR PageRank SSSP single-source shortest path CDLP community detection by label propagation WCC weakly connected components all data no upd.
  • 8. 8 Graph processing landscape all data no upd. lots of upd.limited data lots of data few upd. Interactive Graphalytics BI local queries global queries computations
  • 9. 9 Business Intelligence amount of data accessed expectedexecutiontime Interactive LDBC benchmarks at a glance Graphalytics
  • 10. 10 Business Intelligence amount of data accessed expectedexecutiontime Interactive LDBC benchmarks at a glance Graphalytics Social Network Benchmark
  • 11. 11 Social network graph DATAGEN: Generate realistic graphs Multiple scale factors (SFs) Nodes: Collection attributes Type inheritance Edges: With attributes Edges between similar nodes Network of Persons Reply tree of Posts/Comments
  • 12. 12 Business Intelligence amount of data accessed expectedexecutiontime Interactive LDBC benchmarks at a glance Graphalytics
  • 15. 15 Choke points = a challenging aspect of query processing, a well-chosen difficulty Allows systematic benchmark design Peter Boncz, Thomas Neumann, Orri Erling, TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark, TPCTC 2013
  • 16. 16 Q5: Top posters in a country 1. Find the top 100 Forums by members in a given Country. 2. For each member of the top 100 Forums, count their Posts in the top 100 Forums. 1 2 3 4 4 3 2 1 Forum to Country Country to Forum 57 s 0.3 s CP-2.1 Rich join order optimization Sparksee SF10
  • 17. 17 Choke points: optimizations Top-k pushdown optimization New in LDBC (not covered in TPC-H choke points)
  • 18. 18 Q22: International dialog For each p1-p2 pair, calculate score and get top pair (w/ tie-break) + 4 if p1 replied to p2 + 1 if p2 replied to p1 +15 if p1 knows p2 +10 if p1 liked p2s msg + 1 if p2 liked p1s msg = max. 31 in total
  • 19. 19 Q22: International dialog Avoiding full Cartesian product with Top-k pushdown: Example #1: There are k pairs with maximum points (31). A pair cannot possibly achieve max. points prune Example #2: There are k pairs with at least 20 points. A pair fails the condition for 15 points prune + 4 if p1 replied + 1 if p2 replied +15 if p1 knows p2 +10 if p1 liked + 1 if p2 liked = max. 31 in total
  • 20. 20 Q16: Experts in social circle CP-1.3 [QOPT] Top-k pushdown CP-7.1 [QEXE] Path pattern reuse CP-7.2 [QOPT] Cardinality estimation of transitive paths CP-7.3 [QEXE] Execution of a transitive step Baseline 29 s Top-k 27 s Top-k + Path pattern reuse 15 s Sparksee run times on SF10
  • 21. 21 Language choke points New choke points to cover language features. CP-8.1: Complex patterns CP-8.2: Complex aggregations CP-8.3: Ranking-style queries CP-8.4: Query composition CP-8.5: Dates and times CP-8.6: Handling paths
  • 22. 22 Language choke points New choke points to cover language features. CP-8.1: Complex patterns CP-8.2: Complex aggregations CP-8.3: Ranking-style queries CP-8.4: Query composition CP-8.5: Dates and times CP-8.6: Handling paths Q22: select top pair for each city1 LIMIT 1 not sufficient PostgreSQL: rank()
  • 23. 23 Language choke points New choke points to cover language features. CP-8.1: Complex patterns CP-8.2: Complex aggregations CP-8.3: Ranking-style queries CP-8.4: Query composition CP-8.5: Dates and times CP-8.6: Handling paths Q5: top 100 forums (Important feature of G-CORE.)
  • 24. 24 Language choke points New choke points to cover language features. CP-8.1: Complex patterns CP-8.2: Complex aggregations CP-8.3: Ranking-style queries CP-8.4: Query composition CP-8.5: Dates and times CP-8.6: Handling paths Q1: aggregate for each month Datetime features: SQL SPARQL Cypher: recently added
  • 25. 25 Language choke points New choke points to cover language features. CP-8.1: Complex patterns CP-8.2: Complex aggregations CP-8.3: Ranking-style queries CP-8.4: Query composition CP-8.5: Dates and times CP-8.6: Handling paths
  • 26. 26 Q25: Weighted interaction paths 1. Given two Persons, get all shortest paths on knows edges. 2. For each path, for each edge on the path, calculate a weight. 3. For each path, summarize weights. 4. Return paths, ordered by weights (desc). (Q25 covers 15 CPs, incl. all language-related ones its SQL impl. is ~2500 chars)
  • 27. 27 CP-8.6 Handling paths 1. Path unwinding: higher-order queries Q25: weights based on additional pattern matching on path elements. 2. Matching semantics for paths Homomorphism-based (walks) Isomorphism-based No-repeated-anything No-repeated-edge semantics (trails) No-repeated-node semantics (simple path) Q16: social circle persons connected by an edge-unique paths of [x, y] hops 3. Regular path queries (RPQs) R. Angles et al., Foundations of Modern Query Languages for Graph Databases, ACM Computing Surveys, 2017
  • 29. 29 Implementing the BI workload High-level process 1. Generate data set 2. Implement loader 3. Implement driver adapter 4. Implement queries and validate Very time consuming process, but after 2 validated tools still bugs in both implementations after 3 validated tools still ambiguities in the spec Validation 1. Generate validation set 2. Cross-validate for multiple SFs 3. Failure fix issues and go to 2
  • 30. 30 Implementing the BI workload Cross-validation for implementations Cypher Neo4j 25/25 SQL PostgreSQL 25/25 Imperative Sparksee 25/25 SPARQL Stardog 24/25 PGQL Oracle Labs PGX 10/25 In progress: Spark SQL
  • 31. 31 Roadmap 1. Help industry adoption get more benchmark results 2. Define updates on the graph. Necessitates complex dependency handling (SIGMOD15 paper), and raises many design choices: Affected types which nodes/edges? what distribution? Nature of changes append-only vs. insert and delete Granularity nodes/edges vs. attributes Frequency of changes streaming vs. batch 3. Publish as a conference paper
  • 32. 32 Acknowledgements G叩bor Sz叩rnyas was partially supported by NSERC RGPIN-04573-16 (Canada) and the MTA-BME Lend端let Cyber-Physical Systems Research Group (Hungary). DAMA-UPC research was supported by the grant TIN2017-89244-R from MINECO (Ministerio de Economia, Industria y Competitividad) and the recognition 2017SGR-856 (MACDA) from AGAUR (Generalitat de Catalunya). Sparsity thanks the EU H2020 for funding the Uniserver project (ICT-04-2015-688540). MTA-BME Lend端let Cyber-Physical Systems Research Group Department of Measurement and Information Systems Department of Electrical and Computer Engineering