ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Andra Lungu
Flink committer
andra.lungu
@campus.tu-berlin.de
Large-Scale Graph
Processing with
Apache Flink
What is Gelly?
 Large-scale graph processing API
 On top of Flink’s Java API
 Official release: Flink 0.9
 Off-the shelf library methods
 Supports record and graph analysis
applications; iterative algorithms
2
The Growing Flink Stack
3
How to use Gelly?
4
Graph Creation
5
Graph Properties
 getVertices()
 getEdges()
 getVertexIds()
 getEdgeIds()
 inDegrees()
 outDegrees()
 getDegrees()
 numberOfVertices()
 numberOfEdges()
 getTriplets()
6
Graph Transformations
 Map
• mapVertices(final MapFunction<Vertex<K, VV>,
NV> mapper)
• mapEdges(final MapFunction<Edge<K, EV>, NV>
mapper)
 Filter
• filterOnVertices(FilterFunction<Vertex<K, VV>>
vertexFilter)
• filterOnEdges(FilterFunction<Edge<K, EV>>
edgeFilter)
• subgraph(FilterFunction<Vertex<K, VV>>
vertexFilter, FilterFunction<Edge<K, EV>>
edgeFilter)
7
Filter Functions
8
Graph Transformations
 Join
• joinWithVertices(DataSet<Tuple2<K, T>>
inputDataSet, final MapFunction<Tuple2<VV,
T>, VV> mapper)
• joinWithEdges(DataSet<Tuple3<K, K, T>>
inputDataSet, final MapFunction<Tuple2<EV,
T>, EV> mapper)
• joinWithEdgesOnSource /
joinWithEdgesOnTarget
 Reverse
 Undirected
9
Union
10
Graph Mutations
 addVertex(final Vertex<K, VV> vertex)
 addVertices(List<Vertex<K, VV>>
verticesToAdd)
 addEdge(Vertex<K, VV> source, Vertex<K, VV>
target, EV edgeValue)
 addEdges(List<Edge<K, EV>> newEdges)
 removeVertex(Vertex<K, VV> vertex)
 removeVertices(List<Vertex<K, VV>>
verticesToBeRemoved)
 removeEdge(Edge<K, EV> edge)
 removeEdges(List<Edge<K, EV>>
edgesToBeRemoved)
11
Neighborhood Methods
 reduceOnNeighbors(reduceNeighborsFunc
tion, direction)
 reduceOnEdges
 groupReduceOnNeighbors;
groupReduceOnEdges
12
Graph Validation
 Given criteria:
• Edge IDs correspond to vertex IDs
13
Vertex-centric Iterations
 Pregel [BSP] Execution Model
 UDFs:
• Messaging Function
• VertexUpdateFunction
 S-1: receive messages from neighbors
 S: update vertex values
 S+1: send new value to neighbors
14
Single Source Shortest Paths
15
SSSP – Second Superstep
16
SSSP - Result
17
SSSP – code snippet
18
Gather-Sum-Apply Iterations
 UDFs:
• GatherFunction
• SumFunction
• ApplyFunction
 Back to SSSP:
• Gather: neighbor value + edge weight
• Sum/Accumulate: choose min
• Apply: compare computed min and update
19
SSSP – Superstep 1
20
SSSP – Superstep 2
21
SSSP - Result
22
SSSP – code snippet
23
Vertex-centric or GSA?
 Messaging = Gather + Sum
 Update = Apply
 Skewed graphs? – GSA (parallel
gather)
 coGroup vs. reduce
 GSA gathers from immediate neighbors;
 Vertex-centric send to any vertex
24
Library of Algorithms
 Weakly Connected Components
 Community Detection
 Page Rank
 Single Source Shortest Paths
 Label Propagation
25
Music Profiles Example
26
Input Data
 <user-id, song-id, play-count>
 Set of bad records [IDs]
27
Filter out Bad Records
28
Compute Top Songs/User
29
Compute Top Songs/User
30
Create a user-user Graph
31
Create a user-user Graph
32
Cluster Similar Users
33
Coming up Next
 Gelly Blog Post
 Scala API
 More Library Methods
 Flink Streaming Integration
 Graph Partitioning Techniques
 Specialized Operators for Highly Skewed
Graphs
 Bipartite Graph Support
Curious? Gelly Roadmap
34
flink.apache.org
@ApacheFlink
user@flink.apache.org
dev@flink.apache.org

More Related Content

Flink Gelly - Karlsruhe - June 2015