GraphLab is a framework for parallel machine learning on graphs that represents data as a directed graph. It provides high-level abstractions and guarantees sequential consistency. Distributed GraphLab adds fault tolerance using snapshot algorithms during two-stage partitioning. PowerGraph addresses GraphLab's limitations for highly skewed graphs through a new GAS abstraction model using Gather, Apply, and Scatter functions. GraphChi is an asynchronous disk-based version of GraphLab that processes large graphs using parallel sliding windows of intervals to minimize disk access.
2. GraphLab overview: GraphLab 1.0
¡ñ GraphLab: A New Framework For Parallel
Machine Learning
¨C high-level abstractions for machine learning
problems
¨C Shared-memory multiprocessor
¨C Assume no fault tolerance needed
¨C Concurrent access precessing models with
sequential-consistency guarantees
12/10/12 2
3. GraphLab overview: GraphLab 1.0
¡ñ How GraphLab 1.0 works?
¨C Represent the user's data by a directed graph
¨C Each block of data is represented by a vertex
and a directed edge
¨C Shared data table
¨C User functions:
¡ñ Update: modify the vertex and edges state,
read only to shared table
¡ñ Fold: sequential aggregation to a key entry in
12/10/12
the shared table, modify vertex data 3
¡ñ Merge: Parallelize Fold function
¡ñ Apply: Finalize the key entry in the shared table
5. GraphLab overview: Distributed
GraphLab 1.0
¡ñ Distributed GraphLab: A Framework for
Machine Learning and Data Mining in the
Cloud
¨C Fault tolerance using snapshot algorithm
¨C Improved distributed parallel processing
¨C Two stage partitioning:
¡ñ Atoms generated by ParMetis
¡ñ Ghosts generated by the intersection of the
atoms
12/10/12
¨C Finalize() function for vertex synchronization5
8. PowerGraph: Introduction
¡ñ GraphLab 2.1
¡ñ Problems of highly skewed power-law graphs:
¨C Workload imbalance ==> performance
degradations
¨C Limiting Scalability
¨C Hard to partition if the graph is too large
¨C Storage
¨C Non-parallel computation
12/10/12 8
9. PowerGraph: New Abstraction
¡ñ Original Functions:
¨C Update
¨C Finalize
¨C Fold
¨C Merge
¨C Apply: The synchronization apply
¡ñ Introduce GAS model:
¨C Gather: in, out or all neighbors
12/10/12 ¨C Apply: The GAS model apply 9
¨C Scatter
18. PowerGraph: Discussion
¡ñ Isn't it similar to Pregel Mode?
¨C Partially process the vertex if a message exists
¡ñ Gather, Apply and Scatter are commutative
and associative operations. What if the
computation is not commutative!
¨C Sum up the message values in a specific order
to get the same floating point rounding error.
12/10/12 18
19. PowerGraph and Mizan
¡ñ In Mizan we use partial replication:
W0 W1 W0 W1
b b e
e
c a f c a a' f
d g d g
Compute Phase Communication Phase
12/10/12 19
20. GraphChi: Introduction
¡ñ Asynchronous Disk-based version of
GraphLab
¡ñ Utilizing parallel sliding window
¨C Very small number of non-sequential accesses
to the disk
¡ñ Support for graph updates
¨C Based on Kineograph, a distributed system for
processing a continuous in-flow of graph
12/10/12
updates, while simultaneously running 20
advanced graph mining algorithms.
21. GraphChi: Graph Constrains
¡ñ Graph does not fit in memory
¡ñ A vertex, its edges and values fits in memory
12/10/12 21
22. GraphChi: Disk storage
¡ñ Compressed sparse row (CSR):
¨C Compressed adjacency list with indexes of the
edges.
¨C Fast access to the out-degree vertices.
¡ñ Compressed Sparse Column (CSC):
¨C CSR for the transpose graph
¨C Fast access to the in-degree vertices
¡ñ Shard: Store the edges' data
12/10/12 22
23. GraphChi: Loading the graph
¡ñ Input graph is split into P disjoint intervals to balance
edges, each associated with a shard
¡ñ A shard contains data of the edges of an interval
¡ñ The sub graph is constructed as reading its interval
12/10/12 23
24. GraphChi: Parallel Sliding Windows
¡ñ Each interval is processed in parallel
¡ñ P sequential disk access are required to process
each interval
¡ñ The length of intervals vary with graph distribution
¡ñ P * P disk access required for one superstep
12/10/12 24
28. GraphChi: Evolving Graphs
¡ñ Adding an edge is reflected on the intervals and
shards if read
¡ñ Deleting an edge causes that edge to be ignored
¡ñ Adding and deleting edges are handled after
processing the current interval.
12/10/12 28