[PPT] - Tackling Large Graphs with Secondary Storage Amitabha Roy EPFL 1 PowerPoint Presentation

SLIDE 1

Tackling Large Graphs with Secondary Storage

Amitabha Roy EPFL

1

SLIDE 2

Graphs

Social networks Document networks Biological networks Humans, phones, bank accounts

2

SLIDE 3

Graph are Difficult

Graph mining is challenging problem
Traversal leads to data-dependent accesses
Little predictability
Hard to parallelize efficiently

3

SLIDE 4

Tackling Large Graphs

Normal approach
Throw resources at the problem
What does it take to process a trillion edges ?

4

SLIDE 5

Big Iron

Graph Edges Hardware 1 trillion Tsubame 1 trillion Cray 1 trillion Blue Gene 1 trillion NEC

HPC/Graph500 benchmarks (June 2014)

5

SLIDE 6

Large Clusters

Avery Ching, Facebook @Strata, 2/13/2014 Yes, using 3940 machines

6

SLIDE 7

Big Data

Data is growing exponentially
40 Zettabytes by 2020
Unlikely you can put it all in DRAM
Need PM, SSD, Magnetic disks
Secondary storage != DRAM
Also applicable to graphs

7

SLIDE 8

Motivation

32 machines x 2TB magnetic disk = 64 TB storage
1 trillion edges x 16 bytes per edge = 16 TB storage

If I can store the graph then why can’t I process it ?

8

SLIDE 9

Problem #1

Irregular access patterns

1 2 3 4 6 5

1 2 3 4 5 6

9

SLIDE 10

Problem #1

Random access penalties

RAM SSD Disk 1.4X 20X 200X 2ms seeks on a graph with a trillion edges ~ 1 year ! 

10

SLIDE 11

Problem #2

Partitioning graphs across machines is hard
Random partitions very poor for real-world graphs

Twitter graph: 20X difference with 32 machines !

11

SLIDE 12

Outline

X-Stream (address problem #1)
SlipStream (address problem #2)

12

SLIDE 13

X-Stream

Single machine graph processing system

[SOSP’13]

Turns graph processing into sequential access
Change computation model
Partitioning of graph

13

SLIDE 14

Scatter-Gather

3 4 5

Existing computational model

2 6 1

14

SLIDE 15

Scatter-Gather

3 4 5

Activate vertex

2 6 1

15

SLIDE 16

Scatter-Gather

3 4 5

Scatter Updates

2 6 1

16

SLIDE 17

Scatter-Gather

3 4 5

Gather Updates

2 6 1

17

SLIDE 18

Storage

3 4 5 2 6 1 1 2 3 4 5 6

1 → 5 1 → 6 6 → 2 6 → 4 Edges Vertices

18

SLIDE 19

Edge File

3 4 5 2 6 1 1 2 3 4 5 6

1 → 5 1 → 6 6 → 2 6 → 4 Edges Vertices

19

SLIDE 20

Edge File

1 3 4 6 5

1 → 5 1 → 6

2

6 → 2 6 → 4 SEEK

20

SLIDE 21

Edge-centric Scatter-Gather

1 3 4 6 5

Scan entire edge list 1 → 5 1 → 6

2

6 → 2 6 → 4 SCAN

21

SLIDE 22

Edge-centric Scatter-Gather

1 3 4 6 5

Use only necessary edges 1 → 5 1 → 6

2

6 → 2 6 → 4 SCAN

22

SLIDE 23

Tradeoff

✔ Achieve sequential bandwidth ✖ Need to scan entire edge list

Winning Tradeoff !

23

SLIDE 24

Winning Tradeoff

Real-world graphs have small diameter
Traversals in just a few iterations of scatter-gather
Large number of active vertices in most iterations

24

SLIDE 25

Benefit

1 3 4 6 5

Order oblivious 1 → 5 1 → 6

2

6 → 2 6 → 4 SCAN

25

SLIDE 26

What about the vertices ?

1 3 4 6 5

1 → 5 1 → 6

2

6 → 2 6 → 4 SCAN

1 2 3 4 5 6

SEEK

26

SLIDE 27

What about the vertices ?

1 3 4 6 5

1 → 5 1 → 6

2

6 → 2 6 → 4 SCAN

1 2 3 4 5 6

SEEK Seeking in RAM is free ! How can we fit vertices in RAM ?

27

SLIDE 28

Streaming Partitions

1 → 5 1 → 6 6 → 2 6 → 4

1 2 3 4 5 6

2 → 3

1 3 4 6 5 2

3 → 5 Fits in RAM

28

SLIDE 29

Streaming Partitions

1 → 5 1 → 6 6 → 2 6 → 4

1 2 3 4 5 6

2 → 3

1 3 4 6 5 2

3 → 5 Load in RAM SCAN

29

SLIDE 30

Producing Partitions

No requirement on quality (# of cross edges)
Need only fit into RAM
Random partitions are great
Random partitions work great

30

SLIDE 31

Algorithms Supported

Supports traversal algorithms
BFS, WCC, MIS, SCC, K-Cores, SSSP, BC
Supports algebraic operations on the graph
BP, ALS, SpMV, Pagerank
Good testbed for newer streaming algorithms
HyperANF, Semi-streaming Triangle Counting

31

SLIDE 32

Competition

Graphchi
Another on-disk graph processing system

(OSDI’12)

Special on-disk data structure: shards
Makes accesses look sequential
Producing shards requires sorting edges

32

SLIDE 33

SSD

Time (seconds) 750 1500 2250 3000 Netflix/ALS Twitter/Pagerank RMAT27/WCC

GraphChi (Sharding) X-Stream (Total time)

33

SLIDE 34

More Competition

Applies to any two level memory
Includes CPU cache and DRAM
Main memory graph processing ?
Looked at Ligra (PPoPP 2012)

34

SLIDE 35

35

BFS

Time (seconds) 0.1 1.0 10.0 100.0 CPUs

1 2 4 8 16

Ligra X-Stream

SLIDE 36

36

BFS

Time (seconds) 0.1 1.0 10.0 100.0 1000.0 CPUs

1 2 4 8 16

Ligra X-Stream Ligra (setup)

SLIDE 37

Where we stand

10 billion 100 billion 1 trillion

Powergraph OSDI’12 Ligra PPoPP’12

Edges

X-Stream SOSP’13 1 machine

Pregel SIGMOD’10 300 machines

How do we get further ? Scale out

37

SLIDE 38

SlipStream

Aggregate bandwidth and storage of a cluster
Solves the graph partitioning problem
Rethinking storage access
Rethinking streaming partition execution
We know how to do it right for one machine

38

SLIDE 39

Scaling Out

Assign different streaming partitions to machines

Graph partitioning is hard to get right

39

SLIDE 40

Load Imbalance

SP SP

Red Blue

40

SLIDE 41

Load Imbalance

SP

IDLE IDLE Red Blue

41

SLIDE 42

Flat Storage

SP SP

Stripe data across all disks Allow any machine to access any disk

SP SP

✔Balance Capacity ✔ Balance BW Red Blue

42

SLIDE 43

Flat Storage

SP SP

Stripe data across all disks Allow any machine to access any disk

SP SP

Flat Storage Box Red Blue

43

SLIDE 44

Flat Storage

Assumes full bisection bandwidth network
Can be done at data-center scales
Nightingale et. al. OSDI 2012 using CLOS switches
Already true at rack scale
Like in our cluster

44

SLIDE 45

Flat Storage

SP SP SP SP

Flat Storage Box Red Blue

45

SLIDE 46

Flat Storage

SP SP

Flat Storage Box Red IDLE IDLE Using only half the available bandwidth

46

SLIDE 47

Extracting Parallelism

Edge-centric loop
Stream in edges/updates
Access vertices
What if…
Independent copies of vertices on machines

47

SLIDE 48

Extracting Parallelism

Scan

Vertices

Scatter/Gather

48

SLIDE 49

Scatter Step

Scan Edges

Vertices

Scatter

49

SLIDE 50

Scatter Step

Scan Edges

Vertices

Scatter Flat Storage Box

Vertices

Scatter machine 1 machine 2

50

SLIDE 51

Gather Step

Scan Updates

Vertices

Gather Flat Storage Box

Vertices

Gather machine 1 machine 2

51

SLIDE 52

Merge Step

Vertices Vertices

machine 1 machine 2 Application of updates is commutative

Merge Vertices

No need to go to disk

52

SLIDE 53

X-Stream to SlipStream

SlipStream graph algorithms = X-Stream graph algorithms + Merge function

Easy to write merge function (looks like gather)

53

SLIDE 54

Putting it Together

SP SP

Flat Storage Box Red

54

SLIDE 55

Putting it Together

SP SP

Flat Storage Box Red Copy

55

SLIDE 56

Putting it Together

SP SP

Flat Storage Box Red Red ✔ Back to Full Bandwidth

56

SLIDE 57

Automatic Load Balancing

Flat Storage Box Compute Box

57

SLIDE 58

Recap

Graph Partitioning across machines is hard
Drop locality using flat storage
Make it one disk
Same streaming partition on multiple nodes
Extract full bandwidth from the aggregated disk
Systems approach to solving algorithms problem

58

SLIDE 59

Flat Storage

Distributed Storage layer for SlipStream
Looked at other designs
FDS (OSDI 2012)
GFS (SOSP 2003)
…
Implementing distributed storage is hard ☹

59

SLIDE 60

The Hard Bit

Store Block X

60

SLIDE 61

The Hard Bit

Where is block X ?

Need a location service f: file, block → machine, offset

61

SLIDE 62

Block Location

Store block of updates

62

SLIDE 63

Block Location is Irrelevant

Give me any block of updates

Streaming is order oblivious !

63

SLIDE 64

Random Schedule

Centralized metadata service ⇒ randomization
Connect to a random machine for load/store
Extremely simple implementation

64

SLIDE 65

Downside ?

Can lead to collisions
Collisions reduce utilization

SP SP

Red

SP SP

rand() = 1 rand() = 1 Blue

65

SLIDE 66

No Downside

Utilization lower bound at (1 - 1/e) ~ 62%

66

SLIDE 67

Recap

Building distributed storage is hard
Algorithms approach to solving systems problem
Streaming algorithms are order oblivious
Randomized schedule

67

SLIDE 68

Evaluation Results

32 GB RAM 200 GB SSD 32 cores 2 TB 5200 RPM

1 32

10 GigE full bisection Rack

68

SLIDE 69

Scalability

Solve larger problems using more machines
Used synthetic scale-free graphs
Double problem size (vertices and edges)
Double machine count
Till 32 machines, 4 billion vertices, 64 billion edges

69

SLIDE 70

Scaling RMAT (SSD)

Normalized Wall Time 1 2 3 4 Machines 1 2 4 8 16 32

PR BFS SCC WCC BP MCST Cond. MIS SPMV SSSP

32X problem size at 2.7X cost

70

SLIDE 71

Scaling RMAT (SSD)

Normalized Wall Time 1 2 3 4 Machines 1 2 4 8 16 32

PR BFS SCC WCC BP MCST Cond. MIS SPMV SSSP

32X problem size at 2.7X cost Collisions Engineering Loss of sequentiality 0.5X 1X 0.5X

71

SLIDE 72

Capacity

Largest graph we can fit in our cluster
32 billion vertices, 1 trillion edges
Magnetic disks
BFS
Projected seeks were 1 year

72

SLIDE 73

Terascale

Metric Value Wall Time 2d 9h MTEPS 5 I/O 282 TB BW 1.53 GB/s

Don’t need supercomputers or very large clusters

73

SLIDE 74

Terascale

Metric Value Wall Time 2d 9h MTEPS 5 I/O 282 TB BW 1.53 GB/s

Direct results from unordered edge list

74

SLIDE 75

SlipStream vs. Competition

System RAM Pre-process Run Powergraph 128 GB 1271s 103s SlipStream 32 GB X 1854s

WCC/RMAT/128M vertices 2B edges/2 machines

Preprocessing your data for locality can take a lot of time !

75

SLIDE 76

Where we stand

10 billion 100 billion 1 trillion

Powergraph OSDI’12 Ligra PPoPP’12

Edges

X-Stream SOSP’13 1 machine

Pregel SIGMOD’10 300 machines

SlipStream 32 machines

How do we get further ? Buy more disks :)

76

SLIDE 77

Conclusion

Process large graphs using secondary storage
Match algorithm to systems: streaming
Match system to algorithms: order obliviousness
If you can store it, you can process it

77