[PPT] - EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH HARDWARE ACCELERATED PowerPoint Presentation

SLIDE 1

EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH HARDWARE ACCELERATED TRAVERSAL SCHEDULING

Anurag Mukkara, Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, Daniel Sanchez MICRO 2018

SLIDE 2

The locality problem of graph processing

2

¨ Irregular structure of graphs causes seemingly random memory references ¨ On-chip caches are too small to fit most real-world graphs ¨ Software frameworks improve locality through offline preprocessing ¨ Preprocessing is expensive and often impractical

Baseline GOrder

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Main Memory Accesses Baseline GOrder

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Execution time

3616

Preproc.

verhead

1 PageRank iteration

n UK web graph

SLIDE 3

Improving locality in an online fashion

3

¨ Traversal schedule decides order in which graph edges

are processed

¨ Many real-world graphs have strong community structure ¨ Traversals that follow community structure have good

locality

¨ Performing this in software without preprocessing is not

practical due to scheduling overheads

SLIDE 4

BaselineBDFS

0.0 0.2 0.4 0.6 0.8 1.0

Speedup

Baseline BDFS Baseline- HATSBDFS- HATS

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Speedup

1.8x 2.7x

Contributions

4

¨ BDFS: Bounded Depth-First Scheduling

¤ Performs a series of bounded depth-first explorations ¤ Improves locality for graphs with good community structure

¨ HATS: Hardware Accelerated Traversal Scheduling

¤ A simple unit specialized for traversal scheduling ¤ Cheap and implementable in reconfigurable logic

PageRank Delta on UK web graph

BaselineBDFS

0.0 0.2 0.4 0.6 0.8 1.0

Main Memory Accesses 1.8x

SLIDE 5

Agenda

5

¨ Background ¨ BDFS ¨ HATS ¨ Evaluation

SLIDE 6

Graph data structures

6

2 4 7 9 1 2 3 1 3 2

Offset array Neighbor array

3 2 1

0.8 7.9 3.6 1.2

Vertex data

Graph representation Algorithm-specific Compressed Sparse Row (CSR) Format

SLIDE 7

VO 0.0 0.2 0.4 0.6 0.8 1.0 Main Memory Accesses

Neighbors Offsets Vertex Data

Vertex-ordered (VO) schedule follows layout order

7

V0 V1

Vertex data

V3 V2

Full Spatial Locality No Temporal Locality Low Spatial Locality Low Temporal Locality PageRank on UK web graph

Storage Time Load/Store Neighbor array

1 2 3 1 3 2

¨ Simplifies scheduling and parallelism ¨ Poor locality for vertex data accesses

SLIDE 8

Agenda

8

¨ Background ¨ BDFS ¨ HATS ¨ Evaluation

SLIDE 9

BDFS: Bounded Depth-First Scheduling

9

¨ Vertex data accesses have high potential temporal locality ¨ Following community structure helps harness this locality ¨ BDFS performs a series of bounded depth-first explorations ¨ Traversal starts at vertex with id 0 ¨ Processes all edges of first community

before moving to second

¨ Divide-and-conquer nature of BDFS

¤ Small depth bounds capture most locality ¤ Good locality at all cache levels

SLIDE 10

BDFS reduces total main memory accesses

10 BDFS

Low Spatial Locality High Temporal Locality Lower Spatial Locality No Temporal Locality

VO BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Main Memory Accesses

Neighbors Offsets Vertex Data

PageRank on UK web graph

Time Vertex data

High Spatial Locality No Temporal Locality Low Spatial Locality Low Temporal Locality

Storage Time Neighbor array VO

SLIDE 11

BDFS in software does not improve performance

11

¨ Scheduling overheads negate the

benefits of better locality

¨ Higher instruction count ¨ Limited ILP and MLP

¤ Interleaved execution of traversal

scheduling and edge processing

¤ Unpredictable data-dependent branches

VO BDFS 0.0 0.2 0.4 0.6 0.8 1.0

Main Memory Accesses

VO BDFS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

Instructions

VO BDFS 0.0 0.2 0.4 0.6 0.8 1.0 1.2

Execution Time

Memory Accesses Instructions Execution Time

SLIDE 12

Agenda

12

¨ Background ¨ BDFS ¨ HATS ¨ Evaluation

SLIDE 13

HATS: Hardware Accelerated Traversal Scheduling 13

¨ Decouples traversal scheduling from edge processing logic ¨ Small hardware unit near each core to perform traversal scheduling ¨ General-purpose core runs algorithm-specific edge processing logic ¨ HATS is decoupled from the core and runs ahead of it

L2 L1 Core L2 L1 Core

…

Shared L3 Main Memory

HATS HATS

SLIDE 14

HATS operation and design

14

HATS

Config. FIFO Edge Buffer

L1 Core

Prefetches HATS accs. Core accs.

L2

Scan Fetch Offsets Fetch Neighbors Prefetch

VO-HATS Stack

Exploration FSM

Scan Fetch Offsets Fetch Neighbors Prefetch

BDFS-HATS

SLIDE 15

HATS costs

15

¨ Adds only one new instruction

¤ Fetches edge from FIFO buffer to core registers

¨ Very cheap and energy-efficient over a general-purpose core

¤ RTL synthesis with a 65nm process and 1GHz target frequency

ASIC FPGA Area TDP Area 0.4% of core 0.2% of core 3200 LUTs

SLIDE 16

HATS benefits

16

¨ Reduces work for general-purpose core for VO ¨ Enables sophisticated scheduling like BDFS ¨ Performs accurate indirect prefetching of vertex data ¨ Accelerates a wide range of algorithms ¨ Requires changes to graph framework only, not algorithm code

SLIDE 17

Agenda

17

¨ Background ¨ BDFS ¨ HATS ¨ Evaluation

SLIDE 18

Evaluation methodology

18

¨ Event-driven simulation using zsim ¨ 16-core processor

¤ Haswell-like OOO cores ¤ 32 MB L3 cache ¤ 4 memory controllers

¨ IMP [Yu, MICRO’15]

¤ Indirect Memory Prefetcher ¤ Configured with graph data

structure information for accurate prefetching

¨ 5 applications from Ligra framework

¤ PageRank (PR) ¤ PageRank Delta (PRD) ¤ Connected Components (CC) ¤ Radii Estimation (RE) ¤ Maximal Independent Set (MIS)

¨ 5 large real-world graph inputs

¤ Millions of vertices ¤ Billions of edges

SLIDE 19

PR PRD CC RE MIS 20 40 60 80 100 120 Speedup over VO (%) IMP VO-HATS BDFS-HATS

HATS improves performance significantly

19

¨ IMP improves performance by

hiding latency

¨ VO-HATS outperforms IMP by

ffloading traversal scheduling

from general-purpose core

¨ BDFS-HATS gives further gains

by reducing memory accesses

PR PRD CC RE MIS 20 40 60 80 100 120 Speedup over VO (%) IMP VO-HATS BDFS-HATS PR PRD CC RE MIS 20 40 60 80 100 120 Speedup over VO (%) IMP VO-HATS BDFS-HATS PR PRD CC RE MIS 20 40 60 80 100 120 Speedup over VO (%) IMP VO-HATS BDFS-HATS

SLIDE 20

PR PRD CC RE MIS 0.0 0.2 0.4 0.6 0.8 1.0 Normalized energy VO IMP VO-HATS BDFS-HATS

HATS reduces both on-chip and off-chip energy

20

¨ IMP reduces static energy due

to faster execution time

¨ VO-HATS reduces core energy

due to lower instruction count

¨ BDFS-HATS reduces memory

energy due to better locality

PR PRD CC RE MIS 0.0 0.2 0.4 0.6 0.8 1.0 Normalized energy VO IMP VO-HATS BDFS-HATS PR PRD CC RE MIS 0.0 0.2 0.4 0.6 0.8 1.0 Normalized energy VO IMP VO-HATS BDFS-HATS PR PRD CC RE MIS 0.0 0.2 0.4 0.6 0.8 1.0 Normalized energy VO IMP VO-HATS BDFS-HATS PR PRD CC RE MIS 0.0 0.2 0.4 0.6 0.8 1.0 Normalized energy VO IMP VO-HATS BDFS-HATS

Off-chip (Memory) On-chip (Core + Cache)

SLIDE 21

See paper for more results

21

¨ HATS on an on-chip reconfigurable fabric

¤ Parallelism enhancements to maintain throughput at slower clock cycle

¨ Sensitivity to on-chip location of HATS (L1, L2, LLC) ¨ Adaptive-HATS

¤ Avoids performance loss for graphs with no community structure

¨ HATS versus other locality optimizations

SLIDE 22

Conclusion

22

¨ Graph processing is bottlenecked by main memory accesses ¨ BDFS exploits community structure to improve cache locality ¨ HATS accelerates traversal scheduling to make BDFS practical