[PPT] - Graph processing is memory-bound 2 Irregular structure causes PowerPoint Presentation

SLIDE 1

CACHE-GUIDED SCHEDULING

EXPLOITING CACHES TO MAXIMIZE LOCALITY IN GRAPH PROCESSING Anurag Mukkara, Nathan Beckmann, Daniel Sanchez

1st AGP – Toronto, Ontario – 24 June 2017

SLIDE 2

hollywood wikipedia liveJournal indochina webbase uk

Graph Input

5 10 15 20 25 30 35

nJ per Edge

Graph processing is memory-bound

¨ Irregular structure causes seemingly random memory references ¨ On-chip caches are too small to fit most real-world graphs

hollywood wikipedia liveJournal indochina webbase uk

Graph Input

5 10 15 20 25 30 35

nJ per Edge

Main Memory Compute + Caches

50% of system energy is due to main-memory

h

l

l y w

d

w i k i p e d i a l i v e J

u

r n a l i n d

c

h i n a w e b b a s e u k

Graph Input

5 10 15 20 25 30 35

nJ per Edge

Main Memory Compute + Caches

General-purpose system Specialized accelerator

Memory bottleneck becomes more critical

2 PageRank

SLIDE 3

Exploiting graph structure through caches

3

¨ Real-world graphs have strong community structure

¤ Significant potential locality ¤ Difficult to predict ahead of time

¨ Idea: Let the cache guide scheduling!

¤ Cache has information about the right vertices to

process next – those which cause fewest misses

¨ This work: A limit study on the benefits of

cache-guided scheduling (CGS)

¤ CGS reduces misses by up to 6x

SLIDE 4

Impact of Scheduling on Locality

SLIDE 5

Many graph algorithms allow flexibility in schedule 5

¨ Schedule: Order in which vertices of the graph are processed ¨ Many important algorithms are unordered – schedule does not affect

correctness

¤ Ex. PageRank, Collaborative Filtering, Label Propagation, Triangle Counting

¨ Schedule impacts locality significantly

SLIDE 6

Vertex-ordered schedule follows layout order

6

¨ Vertices are processed in the order of

their id

¨ All edges of a vertex are processed

consecutively

¨ Used by state-of-the-art graph

processing frameworks

¤ Ligra, GraphMat, etc.

¨ Simplifies scheduling and parallelism ¨ Poor locality

Offsets Destinations

EdgeList

SLIDE 7

Layout order might not match community structure

7 Consecutive vertices in layout are spread out across the graph In-memory vertex layout

SLIDE 8

Access pattern of vertex-ordered schedule

8 Streaming Random

Edge list . . . . . . . . . Vertex data . . . .

Low High Cache misses

SLIDE 9

Preprocessing changes layout for better order

9 Preprocessing Vertices in each cluster map to consecutive vertex ids

SLIDE 10

Access pattern with preprocessed graph

10 Streaming Good locality

Edge list . . . . . . . . . Vertex data . . . .

Low Low Cache misses

SLIDE 11

Preprocessing is often impractical

11

hol wik liv ind web uk 1 2 3 4 5

Normalized Runtime

36 20 9

PageRank Preprocessing

Wei et al. Speedup Graph Processing by Graph Ordering, SIGMOD’16

Preprocessing is more

expensive than algorithm itself

Impractical for many

important use cases

SLIDE 12

Cache-guided scheduling finds good order at runtime

12 Slightly Irregular Good Locality

Edge list . . . . . . . . . Vertex data . . . .

Moderate Low Cache misses

SLIDE 13

Cache-Guided Scheduling Design

SLIDE 14

High-level design

14 Shared Last Level Cache Cache Engine

Loads

Event Notifications

Tasks

Probes

Query cache contents Core 4 Core 1

Main Memory Stores

Core 2 Core 3 Maintains a list of tasks ranked based on a locality metric

SLIDE 15

Costs, benefits, and idealizations

15

¨ Extra memory accesses to edge list

¤ Filling worklist with tasks ¤ Keeping task scores up to date

¨ Space overheads of worklist and auxiliary metadata

¤ Takes away some of the available cache capacity

¨ Large reduction in memory accesses

¤ Better energy efficiency and performance

For this limit study we ignore these costs

SLIDE 16

Cache-Guided Scheduling of Vertices (CGS-V)

16

¨ Ranks and schedules each vertex of the graph ¨ Vertices ranked by fraction of neighbors that are cached

Uncached Vertices Cached Vertices

SLIDE 17

Cache-Guided Scheduling of Vertices (CGS-V)

17

¨ Large locality benefits ¨ Track vertices only (not edges) ¨ Pitfall: Real-world graphs have skewed degree distributions

¤ Many high-degree vertices that are connected to most of the graph

¨ Processing high-degree vertices

¤ Flushes the cache and kills locality ¤ Misses opportunities to process other beneficial regions

SLIDE 18

Cache-Guided Scheduling of Edges (CGS-E)

18

¨ Ranks and schedules edges instead of vertices ¨ Better locality due to finer-grained scheduling ¨ Each edge causes exactly two cache accesses

¤ Simpler ranking algorithm - Number of endpoints that are cached

¨ #Edges >> #Vertices → Higher tracking overheads

SLIDE 19

Limit Study on Benefits of CGS

SLIDE 20

Methodology

20

¨ Large real-world graphs with up to 100 million vertices, 1 billion edges ¨ Graph algorithms

¤ PageRank – 16-byte vertex objects ¤ Collaborative Filtering – 256-byte vertex objects

¨ Custom cache simulator to compute main-memory accesses

¤ Single core system ¤ 2-level cache hierarchy with 32KB L1, 8MB L2

¨ See paper for details

Graph hol wik liv ind uk web nfl yms Vertices (Millions) 1.1 3.5 4.8 7.4 19 118 0.5 0.5 Edges (Millions) 113 45 69 194 298 1020 100 61

SLIDE 21

Large reduction in memory accesses for PageRank

21

hol wik liv ind web uk gmean

Graph Input

0.0 0.2 0.4 0.6 0.8 1.0

Main Memory Accesses Vertex-Ordered CGS-V CGS-E

Memory Access Reduction CGS-V - 2.4x gmean CGS-E - 4.6x gmean

SLIDE 22

Much larger benefits with Collaborative Filtering

22

Vertex-Ordered CGS-V CGS-E nfl yms gmean

Graph Input

0.0 0.2 0.4 0.6 0.8 1.0

Main Memory Accesses

Memory Access Reduction CGS-V - 1.5x gmean CGS-E -12x gmean Larger vertex data – 256 bytes per vertex

Edge list accesses are negligible (3% only)
Finer-granularity scheduling of CGS-E

becomes more important

SLIDE 23

hol wik liv ind web uk gmean

Graph Input

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Main Memory Accesses

Preproc.

CGS benefits from better graph layout

23

hol wik liv ind web uk gmean

Graph Input

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Main Memory Accesses

Preproc. CGS-V

hol wik liv ind web uk gmean

Graph Input

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Main Memory Accesses

Preproc. CGS-V CGS-V + Preproc.

hol wik liv ind web uk gmean

Graph Input

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Main Memory Accesses

Preproc. CGS-V CGS-V + Preproc. CGS-E

hol wik liv ind web uk gmean

Graph Input

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Main Memory Accesses

Preproc. CGS-V CGS-V + Preproc. CGS-E CGS-E + Preproc.

SLIDE 24

Ongoing Work CGS Hardware Implementation

SLIDE 25

Reducing storage overheads

25

¨ Maintaining all vertices in the worklist is prohibitively expensive ¨ Can a small worklist capture most of the benefits?

¤ Order in which the worklist is filled is crucial

¨ Adding vertices in order of their id is bad

¤ Explores multiple disjoint regions of the graph simultaneously

¨ Insight: Explore the graph in depth-first fashion to fill the worklist

¤ 100 element worklist gives 50% of the benefits of CGS-E

SLIDE 26

Reducing processing overheads

26

¨ Processing each edge takes only a few instructions

¤ Ex. PageRank: One floating point addition per edge ¤ Task scheduling logic must be cheap

¨ CGS-E gives much better locality than CGS-V, but has higher overheads ¨ Practical middle ground: Each task processes a cache line of edges

¤ Minimizes loss of spatial locality in edge list accesses ¤ Sidesteps the issue of high-degree nodes

SLIDE 27

Conclusion

27

¨ Real-world graphs have abundant locality, but hard to predict ¨ Cache has rich information about which regions are best to process ¨ Cache-Guided Scheduling gives large reduction in memory accesses

SLIDE 28