Tackling Large Graphs with Secondary Storage Amitabha Roy EPFL 1 - - PowerPoint PPT Presentation

tackling large graphs with secondary storage
SMART_READER_LITE
LIVE PREVIEW

Tackling Large Graphs with Secondary Storage Amitabha Roy EPFL 1 - - PowerPoint PPT Presentation

Tackling Large Graphs with Secondary Storage Amitabha Roy EPFL 1 Graphs Social networks Document networks Biological networks Humans, phones, bank accounts 2 Graph are Difficult Graph mining is challenging problem Traversal leads


slide-1
SLIDE 1

Tackling Large Graphs with Secondary Storage

Amitabha Roy EPFL

1

slide-2
SLIDE 2

Graphs

Social networks Document networks Biological networks Humans, phones, bank accounts

2

slide-3
SLIDE 3

Graph are Difficult

  • Graph mining is challenging problem
  • Traversal leads to data-dependent accesses
  • Little predictability
  • Hard to parallelize efficiently

3

slide-4
SLIDE 4

Tackling Large Graphs

  • Normal approach
  • Throw resources at the problem
  • What does it take to process a trillion edges ?

4

slide-5
SLIDE 5

Big Iron

Graph Edges Hardware 1 trillion Tsubame 1 trillion Cray 1 trillion Blue Gene 1 trillion NEC

HPC/Graph500 benchmarks (June 2014)

5

slide-6
SLIDE 6

Large Clusters

Avery Ching, Facebook @Strata, 2/13/2014 Yes, using 3940 machines

6

slide-7
SLIDE 7

Big Data

  • Data is growing exponentially
  • 40 Zettabytes by 2020
  • Unlikely you can put it all in DRAM
  • Need PM, SSD, Magnetic disks
  • Secondary storage != DRAM
  • Also applicable to graphs

7

slide-8
SLIDE 8

Motivation

  • 32 machines x 2TB magnetic disk = 64 TB storage
  • 1 trillion edges x 16 bytes per edge = 16 TB storage

If I can store the graph then why can’t I process it ?

8

slide-9
SLIDE 9

Problem #1

  • Irregular access patterns

1 2 3 4 6 5

1 2 3 4 5 6

9

slide-10
SLIDE 10

Problem #1

  • Random access penalties

RAM SSD Disk 1.4X 20X 200X 2ms seeks on a graph with a trillion edges ~ 1 year !


10

slide-11
SLIDE 11

Problem #2

  • Partitioning graphs across machines is hard
  • Random partitions very poor for real-world graphs

Twitter graph: 20X difference with 32 machines !

11

slide-12
SLIDE 12

Outline

  • X-Stream (address problem #1)
  • SlipStream (address problem #2)

12

slide-13
SLIDE 13

X-Stream

  • Single machine graph processing system

[SOSP’13]

  • Turns graph processing into sequential access
  • Change computation model
  • Partitioning of graph

13

slide-14
SLIDE 14

Scatter-Gather

3 4 5

Existing computational model

2 6 1

14

slide-15
SLIDE 15

Scatter-Gather

3 4 5

Activate vertex

2 6 1

15

slide-16
SLIDE 16

Scatter-Gather

3 4 5

Scatter Updates

2 6 1

16

slide-17
SLIDE 17

Scatter-Gather

3 4 5

Gather Updates

2 6 1

17

slide-18
SLIDE 18

Storage

3 4 5 2 6 1 1 2 3 4 5 6

1 → 5 1 → 6 6 → 2 6 → 4 Edges Vertices

18

slide-19
SLIDE 19

Edge File

3 4 5 2 6 1 1 2 3 4 5 6

1 → 5 1 → 6 6 → 2 6 → 4 Edges Vertices

19

slide-20
SLIDE 20

Edge File

1 3 4 6 5

1 → 5 1 → 6

2

6 → 2 6 → 4 SEEK

20

slide-21
SLIDE 21

Edge-centric Scatter-Gather

1 3 4 6 5

Scan entire edge list 1 → 5 1 → 6

2

6 → 2 6 → 4 SCAN

21

slide-22
SLIDE 22

Edge-centric Scatter-Gather

1 3 4 6 5

Use only necessary edges 1 → 5 1 → 6

2

6 → 2 6 → 4 SCAN

22

slide-23
SLIDE 23

Tradeoff

✔ Achieve sequential bandwidth ✖ Need to scan entire edge list

Winning Tradeoff !

23

slide-24
SLIDE 24

Winning Tradeoff

  • Real-world graphs have small diameter
  • Traversals in just a few iterations of scatter-gather
  • Large number of active vertices in most iterations

24

slide-25
SLIDE 25

Benefit

1 3 4 6 5

Order oblivious 1 → 5 1 → 6

2

6 → 2 6 → 4 SCAN

25

slide-26
SLIDE 26

What about the vertices ?

1 3 4 6 5

1 → 5 1 → 6

2

6 → 2 6 → 4 SCAN

1 2 3 4 5 6

SEEK

26

slide-27
SLIDE 27

What about the vertices ?

1 3 4 6 5

1 → 5 1 → 6

2

6 → 2 6 → 4 SCAN

1 2 3 4 5 6

SEEK Seeking in RAM is free ! How can we fit vertices in RAM ?

27

slide-28
SLIDE 28

Streaming Partitions

1 → 5 1 → 6 6 → 2 6 → 4

1 2 3 4 5 6

2 → 3

1 3 4 6 5 2

3 → 5 Fits in RAM

28

slide-29
SLIDE 29

Streaming Partitions

1 → 5 1 → 6 6 → 2 6 → 4

1 2 3 4 5 6

2 → 3

1 3 4 6 5 2

3 → 5 Load in RAM SCAN

29

slide-30
SLIDE 30

Producing Partitions

  • No requirement on quality (# of cross edges)
  • Need only fit into RAM
  • Random partitions are great
  • Random partitions work great

30

slide-31
SLIDE 31

Algorithms Supported

  • Supports traversal algorithms
  • BFS, WCC, MIS, SCC, K-Cores, SSSP, BC
  • Supports algebraic operations on the graph
  • BP, ALS, SpMV, Pagerank
  • Good testbed for newer streaming algorithms
  • HyperANF, Semi-streaming Triangle Counting

31

slide-32
SLIDE 32

Competition

  • Graphchi
  • Another on-disk graph processing system

(OSDI’12)

  • Special on-disk data structure: shards
  • Makes accesses look sequential
  • Producing shards requires sorting edges

32

slide-33
SLIDE 33

SSD

Time (seconds) 750 1500 2250 3000 Netflix/ALS Twitter/Pagerank RMAT27/WCC

GraphChi (Sharding) X-Stream (Total time)

33

slide-34
SLIDE 34

More Competition

  • Applies to any two level memory
  • Includes CPU cache and DRAM
  • Main memory graph processing ?
  • Looked at Ligra (PPoPP 2012)

34

slide-35
SLIDE 35

35

BFS

Time (seconds) 0.1 1.0 10.0 100.0 CPUs

1 2 4 8 16

Ligra X-Stream

slide-36
SLIDE 36

36

BFS

Time (seconds) 0.1 1.0 10.0 100.0 1000.0 CPUs

1 2 4 8 16

Ligra X-Stream Ligra (setup)

slide-37
SLIDE 37

Where we stand

10 billion 100 billion 1 trillion

Powergraph OSDI’12 Ligra PPoPP’12

Edges

X-Stream SOSP’13 1 machine

Pregel SIGMOD’10 300 machines

How do we get further ? Scale out

37

slide-38
SLIDE 38

SlipStream

  • Aggregate bandwidth and storage of a cluster
  • Solves the graph partitioning problem
  • Rethinking storage access
  • Rethinking streaming partition execution
  • We know how to do it right for one machine

38

slide-39
SLIDE 39

Scaling Out

  • Assign different streaming partitions to machines

Graph partitioning is hard to get right

39

slide-40
SLIDE 40

Load Imbalance

SP SP

Red Blue

40

slide-41
SLIDE 41

Load Imbalance

SP

IDLE IDLE Red Blue

41

slide-42
SLIDE 42

Flat Storage

SP SP

Stripe data across all disks Allow any machine to access any disk

SP SP

✔Balance Capacity ✔ Balance BW Red Blue

42

slide-43
SLIDE 43

Flat Storage

SP SP

Stripe data across all disks Allow any machine to access any disk

SP SP

Flat Storage Box Red Blue

43

slide-44
SLIDE 44

Flat Storage

  • Assumes full bisection bandwidth network
  • Can be done at data-center scales
  • Nightingale et. al. OSDI 2012 using CLOS switches
  • Already true at rack scale
  • Like in our cluster

44

slide-45
SLIDE 45

Flat Storage

SP SP SP SP

Flat Storage Box Red Blue

45

slide-46
SLIDE 46

Flat Storage

SP SP

Flat Storage Box Red IDLE IDLE Using only half the available bandwidth

46

slide-47
SLIDE 47

Extracting Parallelism

  • Edge-centric loop
  • Stream in edges/updates
  • Access vertices
  • What if…
  • Independent copies of vertices on machines

47

slide-48
SLIDE 48

Extracting Parallelism

Scan

Vertices

Scatter/Gather

48

slide-49
SLIDE 49

Scatter Step

Scan Edges

Vertices

Scatter

49

slide-50
SLIDE 50

Scatter Step

Scan Edges

Vertices

Scatter Flat Storage Box

Vertices

Scatter machine 1 machine 2

50

slide-51
SLIDE 51

Gather Step

Scan Updates

Vertices

Gather Flat Storage Box

Vertices

Gather machine 1 machine 2

51

slide-52
SLIDE 52

Merge Step

Vertices Vertices

machine 1 machine 2 Application of updates is commutative

Merge Vertices

No need to go to disk

52

slide-53
SLIDE 53

X-Stream to SlipStream

SlipStream graph algorithms = X-Stream graph algorithms + Merge function

  • Easy to write merge function (looks like gather)

53

slide-54
SLIDE 54

Putting it Together

SP SP

Flat Storage Box Red

54

slide-55
SLIDE 55

Putting it Together

SP SP

Flat Storage Box Red Copy

55

slide-56
SLIDE 56

Putting it Together

SP SP

Flat Storage Box Red Red ✔ Back to Full Bandwidth

56

slide-57
SLIDE 57

Automatic Load Balancing

Flat Storage Box Compute Box

57

slide-58
SLIDE 58

Recap

  • Graph Partitioning across machines is hard
  • Drop locality using flat storage
  • Make it one disk
  • Same streaming partition on multiple nodes
  • Extract full bandwidth from the aggregated disk
  • Systems approach to solving algorithms problem

58

slide-59
SLIDE 59

Flat Storage

  • Distributed Storage layer for SlipStream
  • Looked at other designs
  • FDS (OSDI 2012)
  • GFS (SOSP 2003)
  • Implementing distributed storage is hard ☹

59

slide-60
SLIDE 60

The Hard Bit

Store Block X

60

slide-61
SLIDE 61

The Hard Bit

Where is block X ?

Need a location service f: file, block → machine, offset

61

slide-62
SLIDE 62

Block Location

Store block of updates

62

slide-63
SLIDE 63

Block Location is Irrelevant

Give me any block of updates

Streaming is order oblivious !

63

slide-64
SLIDE 64

Random Schedule

  • Centralized metadata service ⇒ randomization
  • Connect to a random machine for load/store
  • Extremely simple implementation

64

slide-65
SLIDE 65

Downside ?

  • Can lead to collisions
  • Collisions reduce utilization

SP SP

Red

SP SP

rand() = 1 rand() = 1 Blue

65

slide-66
SLIDE 66

No Downside

  • Utilization lower bound at (1 - 1/e) ~ 62%

66

slide-67
SLIDE 67

Recap

  • Building distributed storage is hard
  • Algorithms approach to solving systems problem
  • Streaming algorithms are order oblivious
  • Randomized schedule

67

slide-68
SLIDE 68

Evaluation Results

32 GB RAM 200 GB SSD 32 cores 2 TB 5200 RPM

1 32

10 GigE full bisection Rack

68

slide-69
SLIDE 69

Scalability

  • Solve larger problems using more machines
  • Used synthetic scale-free graphs
  • Double problem size (vertices and edges)
  • Double machine count
  • Till 32 machines, 4 billion vertices, 64 billion edges

69

slide-70
SLIDE 70

Scaling RMAT (SSD)

Normalized Wall Time 1 2 3 4 Machines 1 2 4 8 16 32

PR BFS SCC WCC BP MCST Cond. MIS SPMV SSSP

32X problem size at 2.7X cost

70

slide-71
SLIDE 71

Scaling RMAT (SSD)

Normalized Wall Time 1 2 3 4 Machines 1 2 4 8 16 32

PR BFS SCC WCC BP MCST Cond. MIS SPMV SSSP

32X problem size at 2.7X cost Collisions Engineering Loss of sequentiality 0.5X 1X 0.5X

71

slide-72
SLIDE 72

Capacity

  • Largest graph we can fit in our cluster
  • 32 billion vertices, 1 trillion edges
  • Magnetic disks
  • BFS
  • Projected seeks were 1 year

72

slide-73
SLIDE 73

Terascale

Metric Value Wall Time 2d 9h MTEPS 5 I/O 282 TB BW 1.53 GB/s

Don’t need supercomputers or very large clusters

73

slide-74
SLIDE 74

Terascale

Metric Value Wall Time 2d 9h MTEPS 5 I/O 282 TB BW 1.53 GB/s

Direct results from unordered edge list

74

slide-75
SLIDE 75

SlipStream vs. Competition

System RAM Pre-process Run Powergraph 128 GB 1271s 103s SlipStream 32 GB X 1854s

WCC/RMAT/128M vertices 2B edges/2 machines

Preprocessing your data for locality can take a lot of time !

75

slide-76
SLIDE 76

Where we stand

10 billion 100 billion 1 trillion

Powergraph OSDI’12 Ligra PPoPP’12

Edges

X-Stream SOSP’13 1 machine

Pregel SIGMOD’10 300 machines

SlipStream 32 machines

How do we get further ? Buy more disks :)

76

slide-77
SLIDE 77

Conclusion

  • Process large graphs using secondary storage
  • Match algorithm to systems: streaming
  • Match system to algorithms: order obliviousness
  • If you can store it, you can process it

77