�ݺ�ߣ

GraphLab under the hood

Zuhair Khayyat

12/10/12 1

GraphLab overview: GraphLab 1.0
�� GraphLab: A New Framework For Parallel
Machine Learning
�C high-level abstractions for machine learning
problems
�C Shared-memory multiprocessor
�C Assume no fault tolerance needed
�C Concurrent access precessing models with
sequential-consistency guarantees

12/10/12 2

�� How GraphLab 1.0 works?
�C Represent the user's data by a directed graph
�C Each block of data is represented by a vertex
and a directed edge
�C Shared data table
�C User functions:
�� Update: modify the vertex and edges state,
read only to shared table
�� Fold: sequential aggregation to a key entry in
12/10/12
the shared table, modify vertex data 3
�� Merge: Parallelize Fold function
�� Apply: Finalize the key entry in the shared table


12/10/12 4

GraphLab overview: Distributed
GraphLab 1.0
�� Distributed GraphLab: A Framework for
Machine Learning and Data Mining in the
Cloud
�C Fault tolerance using snapshot algorithm
�C Improved distributed parallel processing
�C Two stage partitioning:
�� Atoms generated by ParMetis
�� Ghosts generated by the intersection of the
atoms
12/10/12
�C Finalize() function for vertex synchronization5

GraphLab 1.0

12/10/12 6

GraphLab 1.0

12/10/12 7

Worker 1 Worker 2
GHosts

PowerGraph: Introduction

�� GraphLab 2.1
�� Problems of highly skewed power-law graphs:
�C Workload imbalance ==> performance
degradations
�C Limiting Scalability
�C Hard to partition if the graph is too large
�C Storage
�C Non-parallel computation
12/10/12 8

PowerGraph: New Abstraction
�� Original Functions:
�C Update
�C Finalize
�C Fold
�C Merge
�C Apply: The synchronization apply
�� Introduce GAS model:
�C Gather: in, out or all neighbors
12/10/12 �C Apply: The GAS model apply 9

�C Scatter

PowerGraph: Gather

12/10/12 10

Worker 1 Worker 2

PowerGraph: Apply

12/10/12 11

Worker 1 Worker 2

PowerGraph: Scatter

12/10/12 12

Worker 1 Worker 2

PowerGraph: Vertex Cut
A B A H
A

B A G B C

G B H C D

H C
C H C I

F D E D I
I
E F E I

E D F H F G

12/10/12 13

PowerGraph: Vertex Cut
A B C
A B A H
D
A G B C F H

B H C D I

C H C I
A H
D E D I
A G
E B
E F E I

C D
F H F G F G
12/10/12 14

E I C I

PowerGraph: Vertex Cut (Greedy)

A B A H A B

A G B C
G H C
B H C D

C H C I
B C C D
D E D I

E F E I E H I E

F H F G
12/10/12 15
F G

PowerGraph: Experiment

12/10/12 16

PowerGraph: Experiment

12/10/12 17

PowerGraph: Discussion
�� Isn't it similar to Pregel Mode?
�C Partially process the vertex if a message exists
�� Gather, Apply and Scatter are commutative
and associative operations. What if the
computation is not commutative!
�C Sum up the message values in a specific order
to get the same floating point rounding error.

12/10/12 18

PowerGraph and Mizan
�� In Mizan we use partial replication:

W0 W1 W0 W1

b b e
e

c a f c a a' f

d g d g

Compute Phase Communication Phase
12/10/12 19

GraphChi: Introduction
�� Asynchronous Disk-based version of
GraphLab
�� Utilizing parallel sliding window
�C Very small number of non-sequential accesses
to the disk
�� Support for graph updates
�C Based on Kineograph, a distributed system for
processing a continuous in-flow of graph
12/10/12
updates, while simultaneously running 20
advanced graph mining algorithms.

GraphChi: Graph Constrains
�� Graph does not fit in memory
�� A vertex, its edges and values fits in memory

12/10/12 21

GraphChi: Disk storage
�� Compressed sparse row (CSR):
�C Compressed adjacency list with indexes of the
edges.
�C Fast access to the out-degree vertices.
�� Compressed Sparse Column (CSC):
�C CSR for the transpose graph
�C Fast access to the in-degree vertices
�� Shard: Store the edges' data
12/10/12 22

GraphChi: Loading the graph
�� Input graph is split into P disjoint intervals to balance
edges, each associated with a shard
�� A shard contains data of the edges of an interval
�� The sub graph is constructed as reading its interval

12/10/12 23

GraphChi: Parallel Sliding Windows
�� Each interval is processed in parallel
�� P sequential disk access are required to process
each interval
�� The length of intervals vary with graph distribution
�� P * P disk access required for one superstep

12/10/12 24

GraphChi: Example

Executing interval (1,2):

12/10/12 25
(1,2) (3,4) (5,6)

GraphChi: Example

Executing interval (3,4):

12/10/12 26
(1,2) (3,4) (5,6)

GraphChi: Example

12/10/12 27

GraphChi: Evolving Graphs
�� Adding an edge is reflected on the intervals and
shards if read
�� Deleting an edge causes that edge to be ignored
�� Adding and deleting edges are handled after
processing the current interval.

12/10/12 28

GraphChi: Preprocessing

12/10/12 29

Thank you

12/10/12 30

The Blog wants YOU

12/10/12 31
thegraphsblog.wordpress.com/

�ݺ�ߣ

Graphlab under the hood

More Related Content

Graphlab under the hood