ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Compiling openCypher graph queries
with Spark Catalyst
G¨¢bor Sz¨¢rnyas
pre-holiday Spark des f¨ºtes @ Montr¨¦al
BACKGROUND
? PhD student @ Budapest Univ. of Tech. and Econ., Hungary
? Visiting researcher @ McGill University
RESEARCH TOPIC
Problem statement
? Large graph (100M+ nodes)
? Complex global graph queries
? Evaluate them in <1sec
Approach
? We ¡°cheat¡±
? Build a huge cache
? Maintain results:
incremental views
RESEARCH OBJECTIVES
Create a scalable graph query engine with incremental views
1. graph queries
2. incremental views
3. making it scale
Graph queries
PROPERTY GRAPH DATABASES
NoSQL family
Data model:
vertices, edges
and properties
#1 query approach: graph pattern matching
Note. Spark GraphX is an engine for graph analytics.
CYPHER AND OPENCYPHER
Cypher: query language of the Neo4j graph database.
?Cypher is a declarative, SQL-inspired language for describing
patterns in graphs visually using an ascii-art syntax.¡±
MATCH
(p:Person)-[:PRESENTER_OF]->(:Presentation)-[:AT]->(m:Meetup)
WHERE m.date = 'Monday, December 18, 2017'
RETURN p
?The openCypher project aims to deliver a full and open
specification of the industry¡¯s most widely adopted graph
database query language: Cypher.¡± (late 2015)
OPENCYPHER SYSTEMS
? Increasing adoption
? Relational databases:
o SAP HANA
o AGENS Graph
? Research prototypes:
o Graphflow (Univesity of Waterloo)
o ingraph (incremental graph engine)
(Source: Keynote talk @ GraphConnect NYC 2017)
LINKED DATA BENCHMARK COUNCIL
LDBC is a non-profit organization dedicated to establishing
benchmarks, benchmark practices and benchmark results for
graph data management software.
LDBC¡¯s Social Network Benchmark is an industrial and academic
initiative, formed by principal actors in the field of graph-like
data management.
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
OLAP
local queries
global queries
global computations
OLAP global queries
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
local queries
global computations
Example: ?Friends¡¯ recent likes¡±
MATCH (u:User {id: $userId})-[:FRIEND]-
(f:User)-[l:LIKES]->(p:Post)
RETURN f, p
ORDER BY l.timestamp DESC
LIMIT 10
OLAP global queries
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
local queries
global computations
Orri Erling et al.,
The LDBC Social Network Benchmark: Interactive Workload,
SIGMOD 2015
14 queries and 8 updates
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
local queries
global computations
OLAP global queries
Example: ?One-sided friendships¡±
MATCH (u1:User)-[:FRIEND]-(u2:User)-[l:LIKES]->(p:Post),
(u1)-[:AUTHOR_OF]->(p)
WITH u1, u2, count(l) AS likes
WHERE likes > 10
AND NOT (u1)-[:LIKES]->(:Post)<-[:AUTHOR_OF]-(u2)
RETURN u1, u2
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
local queries
global computations
Arnau Prat, G¨¢bor Sz¨¢rnyas, Alex Averbuch et al.,
The LDBC Social Network Benchmark: BI Workload,
Technical report available, peer-reviewed paper in 2018
OLAP global queries
25 queries with infrequent executions
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
OLAP
local queries
global queries
global computations
? PageRank
? Shortest paths
? Clustering coefficient
Example: ?Find the most central individuals.¡±
Spark: GraphX | Flink: Gelly | Neo4j: Graph Algorithms library
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
OLAP
local queries
global queries
global computations
Alexandru Iosup et al.,
LDBC Graphalytics: A Benchmark for Large-Scale Graph Analysis on
Parallel and Distributed Platforms,
VLDB 2016
One-time execution
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
OLAP
local queries
global queries
global computations
Incremental view maintenance
CYBER-PHYSICAL SYSTEMS: LIVE RAILWAY MODEL
Trailing the switch
Proximity detection
CYBER-PHYSICAL SYSTEMS: LIVE RAILWAY MODEL
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
ON
a b
1
NEXT
ON
NEXT
PROXIMITY DETECTION
seg
1
NEXT: 1..2
t1
ON
MATCH
(t1:Train)-[:ON]->(seg1:Segment)
-[:NEXT*1..2]->(seg2:Segment)
<-[:ON]-(t2:Train)
RETURN t1, t2, seg1, seg2
seg
2
t2
ON
¡Ü ? segments
TRAILING THE SWITCH
seg div
t
STRAIGHT
ON
MATCH (t:Train)-[:ON]->(seg:Segment)
<-[:STRAIGHT]-(sw:Switch)
WHERE sw.position = 'diverging'
RETURN t.number, sw
Evaluate
continuously
INCREMENTAL QUERIES
? Register a set of standing queries
? Continuously evaluate queries on changes
? Approach: build a cache and maintain its content
? First publication: 1974, the Rete algorithm
ingraphclient
register queries
query results
change notifications
update graph
¦Ðt.number, sw
¦Òsw.position = ¡ädiverging¡ä
?
STRAIGHTON
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
ON
div
STRAIGHT
Trailing the switch
ON
¦Ðt.number, sw
¦Òsw.position = ¡ädiverging¡ä
?
STRAIGHTON e2
ON
a1
ON
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
ON
div
STRAIGHT
Trailing the switch
ON
a
1
ON
e
2
ON
¦Ðt.number, sw
¦Òsw.position = ¡ädiverging¡ä
?
STRAIGHTON e2
ON
a1
ON
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
ON
div
STRAIGHT
Trailing the switch
ON
e div
STRAIGHT
e div
STRAIGHT
¦Ðt.number, sw
¦Òsw.position = ¡ädiverging¡ä
?
STRAIGHTON e div
STRAIGHT
e2
ON
a1
ON
e div
STRAIGHT
2
ON
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
ON
div
STRAIGHT
Trailing the switch
ON
e div
STRAIGHT
e2
ON
e div
2
STRAIGHT
ON
¦Ðt.number, sw
¦Òsw.position = ¡ädiverging¡ä
?
STRAIGHTON e div
STRAIGHT
e2
ON
a1
ON
div
STRAIGHTON
e div
STRAIGHT
2
ON
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
ON
div
STRAIGHT
Trailing the switch
ON
e2 div
STRAIGHTON
e2
¦Ðt.number, sw
¦Òsw.position = ¡ädiverging¡ä
?
STRAIGHTON e div
STRAIGHT
e2
ON
a1
ON
e div
STRAIGHT
2
ON
e div
STRAIGHT
2
ON
div2
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
ON
div
STRAIGHT
Trailing the switch
ON
div
2
¦Ðt.number, sw
¦Òsw.position = ¡ädiverging¡ä
?
STRAIGHTON e div
STRAIGHT
e2
ON
a1
ON
e div
STRAIGHT
2
ON
e div
STRAIGHT
2
ON
div2
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
ON
div
STRAIGHT
Trailing the switch
ON
div
2
¦Ðt.number, sw
¦Òsw.position = ¡ädiverging¡ä
?
STRAIGHTON e div
STRAIGHT
e2
ON
a1
ON
e div
STRAIGHT
2
ON
e div
STRAIGHT
2
ON
div2
c e
g
fdiv
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
div
STRAIGHT
Trailing the switch
ON
div
ON
2
d
¦Ðt.number, sw
¦Òsw.position = ¡ädiverging¡ä
?
STRAIGHTON e div
STRAIGHT
d2
ON
a1
ON
e div
STRAIGHT
2
ON
e div
STRAIGHT
2
ON
div2
c e
g
fdiv
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
div
STRAIGHT
Trailing the switch
ON
div
ON
2
d
GRAPH RELATIONAL ALGEBRA
? Basic relational algebra
o projection, selection, join, left outer join, antijoin, union
? Common extensions
o aggregation (?), duplicate-elimination (?), sort (?), top (?)
? Graph-specific extensions
o get-vertices (?)
o expand-out (¡ü), expand-in (¡ý), expand-both (?)
J. Marton, G. Sz¨¢rnyas, D. Varr¨®:
Formalising openCypher Graph Queries in Relational Algebra,
ADBIS, Springer, 2017
Compiling openCypher graph queries with Spark Catalyst
Compiling openCypher graph queries with Spark Catalyst
THEORETICAL ISSUES WITH INCREMENTAL CYPHER
? Graph RA is not incrementally maintainable
o Expand operators
o Property access needs nested data structures (0NF)
o Ordering
o Weak schema
? Most incremental approaches work on flat relational algebra:
o Transform graph relational algebra to a flat one
o Optimize query
G. Sz¨¢rnyas:
Incremental View Maintenance for Property Graph Queries,
arXiv preprint, 2017
PROPOSED WORKFLOW
? Parse
? Compile
? Evaluate query
openCypher
query
Magic
Deployed
query
AST
QUERY ¡°TRAILING THE ³§°Â±õ°Õ°ä±á¡±
PROPERTY ACCESS
Assuming that x is a column of a
graph relation, we use the notation
¡°x.a¡± in selection conditions to
express the access to the
corresponding value of property a in
the property graph.
J. H?lsch, M. Grossniklaus:
An algebra and equivalences to transform graph patterns in Neo4j,
GraphQ @ EDBT 2016
t, seg
t, seg, t.number
sw, seg
sw, seg, sw.position
t.number, sw.position
¦Ðt.number, sw
¦Òsw.position = ¡ädiverging¡ä
?
(sw:Switch)?[:STRAIGHT]?>(seg:Segment)(t:Train)?[:ON]?>(seg:Segment)
t.number, sw
t.number, sw
t, seg, sw
t, seg, t.number, sw, sw.position
t, seg, sw
t, seg, t.number, sw, sw.position
t.number
t.number, sw.position
sw.positiont.number
2
1. external schema
2. extra attributes
3. internal schema
This is the current
implementation
SCHEMA
INFERENCING
Compiling openCypher graph queries with Spark Catalyst
MATCH (t:Train)-[:ON]->(seg:Segment)
<-[:STRAIGHT]-(sw:Switch)
WHERE sw.position = 'diverging'
RETURN t.number, sw
openCypher
query
AST Graph RA
Graph RA Flat RANested RA
Deployed
query
SPARK SQL
? ¡°Spark SQL lets you query structured data inside Spark
programs, using either SQL or a familiar DataFrame API.¡±
http://www.gatorsmile.io/sparksqloverview/
http://www.gatorsmile.io/sparksqloverview/
SPARK CATALYST
? Tree Manipulation Framework
o ¡°Catalyst is an execution-agnostic framework to represent and
manipulate a dataflow graph, i.e. trees of relational operators and
expressions.¡±
? Optimizer (both cost-based and rule-based)
Catalyst
SPARK CATALYST: OBSERVATIONS
? Strong community
? Well-written in general, but noisy here and there (Hive)
? Nice API docs¡­ but not much else
CATALYST EXAMPLES: TREE TRANSFORMATION
CATALYST EXAMPLES: ATTRIBUTE RESOLVER
CATALYST FEATURES: CODE GENERATION
? Generates bytecode for performance
H. Karau, R. Warren:
High Performance Spark: Best Practices for Scaling and
Optimizing Apache Spark
O'Reilly Media, Inc., May 25, 2017
Scalable graph queries
MAKING IT SCALE ¦Ðt.number, sw
¦Òsw.position = ¡ädiverging¡ä
?
STRAIGHTON e div
STRAIGHT
d2
ON
a1
ON
div
STRAIGHT
ON
Actors
Async messages
G. Sz¨¢rnyas et al.,
IncQuery-D: A distributed incremental model query framework in the cloud.
ACM/IEEE MODELS, 2014
openCypher
query
AST Graph RA
Graph RA Flat RANested RA
Deployed
query
ARCHITECTURE
Related work and summary
CAPS: CYPHER FOR APACHE SPARK
? An openCypher project
? ¡°CAPS is built on top of the Spark DataFrames API and uses
features such as the Catalyst optimizer.¡±
? Approach
o Compiles to operations to a custom dataflow graph
o Transforms the dataflow graph to queries on the DataFrames API
(backed by Catalyst)
LESSONS LEARNT
? Simply extending the SQL model is insufficient
? Implemented new components from scratch
o Logical plans
o Attribute resolver
? Still reused a lot of components
o Data model
o Expressions
o Transformations
o Built-in methods: toString, output, etc.
FUTURE DIRECTIONS
? Cost-based optimizer
? Experiment with the LDBC Social Network Benchmark
? Transform queries to SQL
? Integrate engine to Spark
G. Sz¨¢rnyas, A. Prat, A. Averbuch et al.:
The LDBC Social Network Benchmark: BI Workload.
Technical report, peer-reviewed paper in 2018
RELATED RESOURCES
Ingraph github.com/ftsrg/ingraph
Cypher for Apache Spark github.com/opencypher/cypher-for-apache-spark
Slizaa openCypher github.com/slizaa/slizaa-opencypher-xtext
Mastering Apache Spark jaceklaskowski.gitbooks.io/mastering-apache-spark
Scala Days presentation people.apache.org/¡­ | youtu.be/6bCpISym_0w
Deep dive blogpost databricks.com/blog/2015/04/13/deep-dive-¡­
Thanks for the contributions to the ingraph team.

More Related Content

Compiling openCypher graph queries with Spark Catalyst

  • 1. Compiling openCypher graph queries with Spark Catalyst G¨¢bor Sz¨¢rnyas pre-holiday Spark des f¨ºtes @ Montr¨¦al
  • 2. BACKGROUND ? PhD student @ Budapest Univ. of Tech. and Econ., Hungary ? Visiting researcher @ McGill University
  • 3. RESEARCH TOPIC Problem statement ? Large graph (100M+ nodes) ? Complex global graph queries ? Evaluate them in <1sec Approach ? We ¡°cheat¡± ? Build a huge cache ? Maintain results: incremental views
  • 4. RESEARCH OBJECTIVES Create a scalable graph query engine with incremental views 1. graph queries 2. incremental views 3. making it scale
  • 6. PROPERTY GRAPH DATABASES NoSQL family Data model: vertices, edges and properties #1 query approach: graph pattern matching Note. Spark GraphX is an engine for graph analytics.
  • 7. CYPHER AND OPENCYPHER Cypher: query language of the Neo4j graph database. ?Cypher is a declarative, SQL-inspired language for describing patterns in graphs visually using an ascii-art syntax.¡± MATCH (p:Person)-[:PRESENTER_OF]->(:Presentation)-[:AT]->(m:Meetup) WHERE m.date = 'Monday, December 18, 2017' RETURN p ?The openCypher project aims to deliver a full and open specification of the industry¡¯s most widely adopted graph database query language: Cypher.¡± (late 2015)
  • 8. OPENCYPHER SYSTEMS ? Increasing adoption ? Relational databases: o SAP HANA o AGENS Graph ? Research prototypes: o Graphflow (Univesity of Waterloo) o ingraph (incremental graph engine) (Source: Keynote talk @ GraphConnect NYC 2017)
  • 9. LINKED DATA BENCHMARK COUNCIL LDBC is a non-profit organization dedicated to establishing benchmarks, benchmark practices and benchmark results for graph data management software. LDBC¡¯s Social Network Benchmark is an industrial and academic initiative, formed by principal actors in the field of graph-like data management.
  • 10. OVERVIEW OF GRAPH PROCESSING OLTP analytics OLAP local queries global queries global computations
  • 11. OLAP global queries OVERVIEW OF GRAPH PROCESSING OLTP analytics local queries global computations Example: ?Friends¡¯ recent likes¡± MATCH (u:User {id: $userId})-[:FRIEND]- (f:User)-[l:LIKES]->(p:Post) RETURN f, p ORDER BY l.timestamp DESC LIMIT 10
  • 12. OLAP global queries OVERVIEW OF GRAPH PROCESSING OLTP analytics local queries global computations Orri Erling et al., The LDBC Social Network Benchmark: Interactive Workload, SIGMOD 2015 14 queries and 8 updates
  • 13. OVERVIEW OF GRAPH PROCESSING OLTP analytics local queries global computations OLAP global queries Example: ?One-sided friendships¡± MATCH (u1:User)-[:FRIEND]-(u2:User)-[l:LIKES]->(p:Post), (u1)-[:AUTHOR_OF]->(p) WITH u1, u2, count(l) AS likes WHERE likes > 10 AND NOT (u1)-[:LIKES]->(:Post)<-[:AUTHOR_OF]-(u2) RETURN u1, u2
  • 14. OVERVIEW OF GRAPH PROCESSING OLTP analytics local queries global computations Arnau Prat, G¨¢bor Sz¨¢rnyas, Alex Averbuch et al., The LDBC Social Network Benchmark: BI Workload, Technical report available, peer-reviewed paper in 2018 OLAP global queries 25 queries with infrequent executions
  • 15. OVERVIEW OF GRAPH PROCESSING OLTP analytics OLAP local queries global queries global computations ? PageRank ? Shortest paths ? Clustering coefficient Example: ?Find the most central individuals.¡± Spark: GraphX | Flink: Gelly | Neo4j: Graph Algorithms library
  • 16. OVERVIEW OF GRAPH PROCESSING OLTP analytics OLAP local queries global queries global computations Alexandru Iosup et al., LDBC Graphalytics: A Benchmark for Large-Scale Graph Analysis on Parallel and Distributed Platforms, VLDB 2016 One-time execution
  • 17. OVERVIEW OF GRAPH PROCESSING OLTP analytics OLAP local queries global queries global computations
  • 19. CYBER-PHYSICAL SYSTEMS: LIVE RAILWAY MODEL Trailing the switch Proximity detection
  • 20. CYBER-PHYSICAL SYSTEMS: LIVE RAILWAY MODEL c d e g fdiv 2 NEXT NEXT STRAIGHT TOP ON a b 1 NEXT ON NEXT
  • 22. TRAILING THE SWITCH seg div t STRAIGHT ON MATCH (t:Train)-[:ON]->(seg:Segment) <-[:STRAIGHT]-(sw:Switch) WHERE sw.position = 'diverging' RETURN t.number, sw Evaluate continuously
  • 23. INCREMENTAL QUERIES ? Register a set of standing queries ? Continuously evaluate queries on changes ? Approach: build a cache and maintain its content ? First publication: 1974, the Rete algorithm ingraphclient register queries query results change notifications update graph
  • 24. ¦Ðt.number, sw ¦Òsw.position = ¡ädiverging¡ä ? STRAIGHTON c d e g fdiv 2 NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON ON div STRAIGHT Trailing the switch ON
  • 25. ¦Ðt.number, sw ¦Òsw.position = ¡ädiverging¡ä ? STRAIGHTON e2 ON a1 ON c d e g fdiv 2 NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON ON div STRAIGHT Trailing the switch ON a 1 ON e 2 ON
  • 26. ¦Ðt.number, sw ¦Òsw.position = ¡ädiverging¡ä ? STRAIGHTON e2 ON a1 ON c d e g fdiv 2 NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON ON div STRAIGHT Trailing the switch ON e div STRAIGHT e div STRAIGHT
  • 27. ¦Ðt.number, sw ¦Òsw.position = ¡ädiverging¡ä ? STRAIGHTON e div STRAIGHT e2 ON a1 ON e div STRAIGHT 2 ON c d e g fdiv 2 NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON ON div STRAIGHT Trailing the switch ON e div STRAIGHT e2 ON e div 2 STRAIGHT ON
  • 28. ¦Ðt.number, sw ¦Òsw.position = ¡ädiverging¡ä ? STRAIGHTON e div STRAIGHT e2 ON a1 ON div STRAIGHTON e div STRAIGHT 2 ON c d e g fdiv 2 NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON ON div STRAIGHT Trailing the switch ON e2 div STRAIGHTON e2
  • 29. ¦Ðt.number, sw ¦Òsw.position = ¡ädiverging¡ä ? STRAIGHTON e div STRAIGHT e2 ON a1 ON e div STRAIGHT 2 ON e div STRAIGHT 2 ON div2 c d e g fdiv 2 NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON ON div STRAIGHT Trailing the switch ON div 2
  • 30. ¦Ðt.number, sw ¦Òsw.position = ¡ädiverging¡ä ? STRAIGHTON e div STRAIGHT e2 ON a1 ON e div STRAIGHT 2 ON e div STRAIGHT 2 ON div2 c d e g fdiv 2 NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON ON div STRAIGHT Trailing the switch ON div 2
  • 31. ¦Ðt.number, sw ¦Òsw.position = ¡ädiverging¡ä ? STRAIGHTON e div STRAIGHT e2 ON a1 ON e div STRAIGHT 2 ON e div STRAIGHT 2 ON div2 c e g fdiv NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON div STRAIGHT Trailing the switch ON div ON 2 d
  • 32. ¦Ðt.number, sw ¦Òsw.position = ¡ädiverging¡ä ? STRAIGHTON e div STRAIGHT d2 ON a1 ON e div STRAIGHT 2 ON e div STRAIGHT 2 ON div2 c e g fdiv NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON div STRAIGHT Trailing the switch ON div ON 2 d
  • 33. GRAPH RELATIONAL ALGEBRA ? Basic relational algebra o projection, selection, join, left outer join, antijoin, union ? Common extensions o aggregation (?), duplicate-elimination (?), sort (?), top (?) ? Graph-specific extensions o get-vertices (?) o expand-out (¡ü), expand-in (¡ý), expand-both (?) J. Marton, G. Sz¨¢rnyas, D. Varr¨®: Formalising openCypher Graph Queries in Relational Algebra, ADBIS, Springer, 2017
  • 36. THEORETICAL ISSUES WITH INCREMENTAL CYPHER ? Graph RA is not incrementally maintainable o Expand operators o Property access needs nested data structures (0NF) o Ordering o Weak schema ? Most incremental approaches work on flat relational algebra: o Transform graph relational algebra to a flat one o Optimize query G. Sz¨¢rnyas: Incremental View Maintenance for Property Graph Queries, arXiv preprint, 2017
  • 37. PROPOSED WORKFLOW ? Parse ? Compile ? Evaluate query openCypher query Magic Deployed query AST
  • 38. QUERY ¡°TRAILING THE ³§°Â±õ°Õ°ä±á¡±
  • 39. PROPERTY ACCESS Assuming that x is a column of a graph relation, we use the notation ¡°x.a¡± in selection conditions to express the access to the corresponding value of property a in the property graph. J. H?lsch, M. Grossniklaus: An algebra and equivalences to transform graph patterns in Neo4j, GraphQ @ EDBT 2016
  • 40. t, seg t, seg, t.number sw, seg sw, seg, sw.position t.number, sw.position ¦Ðt.number, sw ¦Òsw.position = ¡ädiverging¡ä ? (sw:Switch)?[:STRAIGHT]?>(seg:Segment)(t:Train)?[:ON]?>(seg:Segment) t.number, sw t.number, sw t, seg, sw t, seg, t.number, sw, sw.position t, seg, sw t, seg, t.number, sw, sw.position t.number t.number, sw.position sw.positiont.number 2 1. external schema 2. extra attributes 3. internal schema This is the current implementation SCHEMA INFERENCING
  • 42. MATCH (t:Train)-[:ON]->(seg:Segment) <-[:STRAIGHT]-(sw:Switch) WHERE sw.position = 'diverging' RETURN t.number, sw openCypher query AST Graph RA
  • 43. Graph RA Flat RANested RA Deployed query
  • 44. SPARK SQL ? ¡°Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API.¡± http://www.gatorsmile.io/sparksqloverview/ http://www.gatorsmile.io/sparksqloverview/
  • 45. SPARK CATALYST ? Tree Manipulation Framework o ¡°Catalyst is an execution-agnostic framework to represent and manipulate a dataflow graph, i.e. trees of relational operators and expressions.¡± ? Optimizer (both cost-based and rule-based) Catalyst
  • 46. SPARK CATALYST: OBSERVATIONS ? Strong community ? Well-written in general, but noisy here and there (Hive) ? Nice API docs¡­ but not much else
  • 47. CATALYST EXAMPLES: TREE TRANSFORMATION
  • 49. CATALYST FEATURES: CODE GENERATION ? Generates bytecode for performance H. Karau, R. Warren: High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark O'Reilly Media, Inc., May 25, 2017
  • 51. MAKING IT SCALE ¦Ðt.number, sw ¦Òsw.position = ¡ädiverging¡ä ? STRAIGHTON e div STRAIGHT d2 ON a1 ON div STRAIGHT ON Actors Async messages G. Sz¨¢rnyas et al., IncQuery-D: A distributed incremental model query framework in the cloud. ACM/IEEE MODELS, 2014
  • 52. openCypher query AST Graph RA Graph RA Flat RANested RA Deployed query ARCHITECTURE
  • 53. Related work and summary
  • 54. CAPS: CYPHER FOR APACHE SPARK ? An openCypher project ? ¡°CAPS is built on top of the Spark DataFrames API and uses features such as the Catalyst optimizer.¡± ? Approach o Compiles to operations to a custom dataflow graph o Transforms the dataflow graph to queries on the DataFrames API (backed by Catalyst)
  • 55. LESSONS LEARNT ? Simply extending the SQL model is insufficient ? Implemented new components from scratch o Logical plans o Attribute resolver ? Still reused a lot of components o Data model o Expressions o Transformations o Built-in methods: toString, output, etc.
  • 56. FUTURE DIRECTIONS ? Cost-based optimizer ? Experiment with the LDBC Social Network Benchmark ? Transform queries to SQL ? Integrate engine to Spark G. Sz¨¢rnyas, A. Prat, A. Averbuch et al.: The LDBC Social Network Benchmark: BI Workload. Technical report, peer-reviewed paper in 2018
  • 57. RELATED RESOURCES Ingraph github.com/ftsrg/ingraph Cypher for Apache Spark github.com/opencypher/cypher-for-apache-spark Slizaa openCypher github.com/slizaa/slizaa-opencypher-xtext Mastering Apache Spark jaceklaskowski.gitbooks.io/mastering-apache-spark Scala Days presentation people.apache.org/¡­ | youtu.be/6bCpISym_0w Deep dive blogpost databricks.com/blog/2015/04/13/deep-dive-¡­ Thanks for the contributions to the ingraph team.