Presented at the Montr¨¦al Spark meetup @ https://www.meetup.com/Montreal-Apache-Spark-Meetup/events/245707751/
1 of 57
Download to read offline
More Related Content
Compiling openCypher graph queries with Spark Catalyst
1. Compiling openCypher graph queries
with Spark Catalyst
G¨¢bor Sz¨¢rnyas
pre-holiday Spark des f¨ºtes @ Montr¨¦al
2. BACKGROUND
? PhD student @ Budapest Univ. of Tech. and Econ., Hungary
? Visiting researcher @ McGill University
3. RESEARCH TOPIC
Problem statement
? Large graph (100M+ nodes)
? Complex global graph queries
? Evaluate them in <1sec
Approach
? We ¡°cheat¡±
? Build a huge cache
? Maintain results:
incremental views
4. RESEARCH OBJECTIVES
Create a scalable graph query engine with incremental views
1. graph queries
2. incremental views
3. making it scale
6. PROPERTY GRAPH DATABASES
NoSQL family
Data model:
vertices, edges
and properties
#1 query approach: graph pattern matching
Note. Spark GraphX is an engine for graph analytics.
7. CYPHER AND OPENCYPHER
Cypher: query language of the Neo4j graph database.
?Cypher is a declarative, SQL-inspired language for describing
patterns in graphs visually using an ascii-art syntax.¡±
MATCH
(p:Person)-[:PRESENTER_OF]->(:Presentation)-[:AT]->(m:Meetup)
WHERE m.date = 'Monday, December 18, 2017'
RETURN p
?The openCypher project aims to deliver a full and open
specification of the industry¡¯s most widely adopted graph
database query language: Cypher.¡± (late 2015)
8. OPENCYPHER SYSTEMS
? Increasing adoption
? Relational databases:
o SAP HANA
o AGENS Graph
? Research prototypes:
o Graphflow (Univesity of Waterloo)
o ingraph (incremental graph engine)
(Source: Keynote talk @ GraphConnect NYC 2017)
9. LINKED DATA BENCHMARK COUNCIL
LDBC is a non-profit organization dedicated to establishing
benchmarks, benchmark practices and benchmark results for
graph data management software.
LDBC¡¯s Social Network Benchmark is an industrial and academic
initiative, formed by principal actors in the field of graph-like
data management.
10. OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
OLAP
local queries
global queries
global computations
11. OLAP global queries
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
local queries
global computations
Example: ?Friends¡¯ recent likes¡±
MATCH (u:User {id: $userId})-[:FRIEND]-
(f:User)-[l:LIKES]->(p:Post)
RETURN f, p
ORDER BY l.timestamp DESC
LIMIT 10
12. OLAP global queries
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
local queries
global computations
Orri Erling et al.,
The LDBC Social Network Benchmark: Interactive Workload,
SIGMOD 2015
14 queries and 8 updates
13. OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
local queries
global computations
OLAP global queries
Example: ?One-sided friendships¡±
MATCH (u1:User)-[:FRIEND]-(u2:User)-[l:LIKES]->(p:Post),
(u1)-[:AUTHOR_OF]->(p)
WITH u1, u2, count(l) AS likes
WHERE likes > 10
AND NOT (u1)-[:LIKES]->(:Post)<-[:AUTHOR_OF]-(u2)
RETURN u1, u2
14. OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
local queries
global computations
Arnau Prat, G¨¢bor Sz¨¢rnyas, Alex Averbuch et al.,
The LDBC Social Network Benchmark: BI Workload,
Technical report available, peer-reviewed paper in 2018
OLAP global queries
25 queries with infrequent executions
15. OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
OLAP
local queries
global queries
global computations
? PageRank
? Shortest paths
? Clustering coefficient
Example: ?Find the most central individuals.¡±
Spark: GraphX | Flink: Gelly | Neo4j: Graph Algorithms library
16. OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
OLAP
local queries
global queries
global computations
Alexandru Iosup et al.,
LDBC Graphalytics: A Benchmark for Large-Scale Graph Analysis on
Parallel and Distributed Platforms,
VLDB 2016
One-time execution
17. OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
OLAP
local queries
global queries
global computations
22. TRAILING THE SWITCH
seg div
t
STRAIGHT
ON
MATCH (t:Train)-[:ON]->(seg:Segment)
<-[:STRAIGHT]-(sw:Switch)
WHERE sw.position = 'diverging'
RETURN t.number, sw
Evaluate
continuously
23. INCREMENTAL QUERIES
? Register a set of standing queries
? Continuously evaluate queries on changes
? Approach: build a cache and maintain its content
? First publication: 1974, the Rete algorithm
ingraphclient
register queries
query results
change notifications
update graph
24. ¦Ðt.number, sw
¦Òsw.position = ¡ädiverging¡ä
?
STRAIGHTON
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
ON
div
STRAIGHT
Trailing the switch
ON
25. ¦Ðt.number, sw
¦Òsw.position = ¡ädiverging¡ä
?
STRAIGHTON e2
ON
a1
ON
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
ON
div
STRAIGHT
Trailing the switch
ON
a
1
ON
e
2
ON
26. ¦Ðt.number, sw
¦Òsw.position = ¡ädiverging¡ä
?
STRAIGHTON e2
ON
a1
ON
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
ON
div
STRAIGHT
Trailing the switch
ON
e div
STRAIGHT
e div
STRAIGHT
27. ¦Ðt.number, sw
¦Òsw.position = ¡ädiverging¡ä
?
STRAIGHTON e div
STRAIGHT
e2
ON
a1
ON
e div
STRAIGHT
2
ON
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
ON
div
STRAIGHT
Trailing the switch
ON
e div
STRAIGHT
e2
ON
e div
2
STRAIGHT
ON
28. ¦Ðt.number, sw
¦Òsw.position = ¡ädiverging¡ä
?
STRAIGHTON e div
STRAIGHT
e2
ON
a1
ON
div
STRAIGHTON
e div
STRAIGHT
2
ON
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
ON
div
STRAIGHT
Trailing the switch
ON
e2 div
STRAIGHTON
e2
29. ¦Ðt.number, sw
¦Òsw.position = ¡ädiverging¡ä
?
STRAIGHTON e div
STRAIGHT
e2
ON
a1
ON
e div
STRAIGHT
2
ON
e div
STRAIGHT
2
ON
div2
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
ON
div
STRAIGHT
Trailing the switch
ON
div
2
30. ¦Ðt.number, sw
¦Òsw.position = ¡ädiverging¡ä
?
STRAIGHTON e div
STRAIGHT
e2
ON
a1
ON
e div
STRAIGHT
2
ON
e div
STRAIGHT
2
ON
div2
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
ON
div
STRAIGHT
Trailing the switch
ON
div
2
31. ¦Ðt.number, sw
¦Òsw.position = ¡ädiverging¡ä
?
STRAIGHTON e div
STRAIGHT
e2
ON
a1
ON
e div
STRAIGHT
2
ON
e div
STRAIGHT
2
ON
div2
c e
g
fdiv
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
div
STRAIGHT
Trailing the switch
ON
div
ON
2
d
32. ¦Ðt.number, sw
¦Òsw.position = ¡ädiverging¡ä
?
STRAIGHTON e div
STRAIGHT
d2
ON
a1
ON
e div
STRAIGHT
2
ON
e div
STRAIGHT
2
ON
div2
c e
g
fdiv
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
div
STRAIGHT
Trailing the switch
ON
div
ON
2
d
33. GRAPH RELATIONAL ALGEBRA
? Basic relational algebra
o projection, selection, join, left outer join, antijoin, union
? Common extensions
o aggregation (?), duplicate-elimination (?), sort (?), top (?)
? Graph-specific extensions
o get-vertices (?)
o expand-out (¡ü), expand-in (¡ý), expand-both (?)
J. Marton, G. Sz¨¢rnyas, D. Varr¨®:
Formalising openCypher Graph Queries in Relational Algebra,
ADBIS, Springer, 2017
36. THEORETICAL ISSUES WITH INCREMENTAL CYPHER
? Graph RA is not incrementally maintainable
o Expand operators
o Property access needs nested data structures (0NF)
o Ordering
o Weak schema
? Most incremental approaches work on flat relational algebra:
o Transform graph relational algebra to a flat one
o Optimize query
G. Sz¨¢rnyas:
Incremental View Maintenance for Property Graph Queries,
arXiv preprint, 2017
39. PROPERTY ACCESS
Assuming that x is a column of a
graph relation, we use the notation
¡°x.a¡± in selection conditions to
express the access to the
corresponding value of property a in
the property graph.
J. H?lsch, M. Grossniklaus:
An algebra and equivalences to transform graph patterns in Neo4j,
GraphQ @ EDBT 2016
40. t, seg
t, seg, t.number
sw, seg
sw, seg, sw.position
t.number, sw.position
¦Ðt.number, sw
¦Òsw.position = ¡ädiverging¡ä
?
(sw:Switch)?[:STRAIGHT]?>(seg:Segment)(t:Train)?[:ON]?>(seg:Segment)
t.number, sw
t.number, sw
t, seg, sw
t, seg, t.number, sw, sw.position
t, seg, sw
t, seg, t.number, sw, sw.position
t.number
t.number, sw.position
sw.positiont.number
2
1. external schema
2. extra attributes
3. internal schema
This is the current
implementation
SCHEMA
INFERENCING
44. SPARK SQL
? ¡°Spark SQL lets you query structured data inside Spark
programs, using either SQL or a familiar DataFrame API.¡±
http://www.gatorsmile.io/sparksqloverview/
http://www.gatorsmile.io/sparksqloverview/
45. SPARK CATALYST
? Tree Manipulation Framework
o ¡°Catalyst is an execution-agnostic framework to represent and
manipulate a dataflow graph, i.e. trees of relational operators and
expressions.¡±
? Optimizer (both cost-based and rule-based)
Catalyst
46. SPARK CATALYST: OBSERVATIONS
? Strong community
? Well-written in general, but noisy here and there (Hive)
? Nice API docs¡ but not much else
49. CATALYST FEATURES: CODE GENERATION
? Generates bytecode for performance
H. Karau, R. Warren:
High Performance Spark: Best Practices for Scaling and
Optimizing Apache Spark
O'Reilly Media, Inc., May 25, 2017
51. MAKING IT SCALE ¦Ðt.number, sw
¦Òsw.position = ¡ädiverging¡ä
?
STRAIGHTON e div
STRAIGHT
d2
ON
a1
ON
div
STRAIGHT
ON
Actors
Async messages
G. Sz¨¢rnyas et al.,
IncQuery-D: A distributed incremental model query framework in the cloud.
ACM/IEEE MODELS, 2014
54. CAPS: CYPHER FOR APACHE SPARK
? An openCypher project
? ¡°CAPS is built on top of the Spark DataFrames API and uses
features such as the Catalyst optimizer.¡±
? Approach
o Compiles to operations to a custom dataflow graph
o Transforms the dataflow graph to queries on the DataFrames API
(backed by Catalyst)
55. LESSONS LEARNT
? Simply extending the SQL model is insufficient
? Implemented new components from scratch
o Logical plans
o Attribute resolver
? Still reused a lot of components
o Data model
o Expressions
o Transformations
o Built-in methods: toString, output, etc.
56. FUTURE DIRECTIONS
? Cost-based optimizer
? Experiment with the LDBC Social Network Benchmark
? Transform queries to SQL
? Integrate engine to Spark
G. Sz¨¢rnyas, A. Prat, A. Averbuch et al.:
The LDBC Social Network Benchmark: BI Workload.
Technical report, peer-reviewed paper in 2018
57. RELATED RESOURCES
Ingraph github.com/ftsrg/ingraph
Cypher for Apache Spark github.com/opencypher/cypher-for-apache-spark
Slizaa openCypher github.com/slizaa/slizaa-opencypher-xtext
Mastering Apache Spark jaceklaskowski.gitbooks.io/mastering-apache-spark
Scala Days presentation people.apache.org/¡ | youtu.be/6bCpISym_0w
Deep dive blogpost databricks.com/blog/2015/04/13/deep-dive-¡
Thanks for the contributions to the ingraph team.