EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH HARDWARE ACCELERATED - - PowerPoint PPT Presentation
EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH HARDWARE ACCELERATED - - PowerPoint PPT Presentation
EXPLOITING LOCALITY IN GRAPH ANALYTICS THROUGH HARDWARE ACCELERATED TRAVERSAL SCHEDULING Anurag Mukkara , Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, Daniel Sanchez MICRO 2018 The locality problem of graph processing 2 Irregular
The locality problem of graph processing
2
¨ Irregular structure of graphs causes seemingly random memory references ¨ On-chip caches are too small to fit most real-world graphs ¨ Software frameworks improve locality through offline preprocessing ¨ Preprocessing is expensive and often impractical
Baseline GOrder
0.0 0.2 0.4 0.6 0.8 1.0 1.2
Main Memory Accesses Baseline GOrder
0.0 0.2 0.4 0.6 0.8 1.0 1.2
Execution time
3616
Preproc.
- verhead
1 PageRank iteration
- n UK web graph
Improving locality in an online fashion
3
¨ Traversal schedule decides order in which graph edges
are processed
¨ Many real-world graphs have strong community structure ¨ Traversals that follow community structure have good
locality
¨ Performing this in software without preprocessing is not
practical due to scheduling overheads
BaselineBDFS
0.0 0.2 0.4 0.6 0.8 1.0
Speedup
Baseline BDFS Baseline- HATSBDFS- HATS
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Speedup
1.8x 2.7x
Contributions
4
¨ BDFS: Bounded Depth-First Scheduling
¤ Performs a series of bounded depth-first explorations ¤ Improves locality for graphs with good community structure
¨ HATS: Hardware Accelerated Traversal Scheduling
¤ A simple unit specialized for traversal scheduling ¤ Cheap and implementable in reconfigurable logic
PageRank Delta on UK web graph
BaselineBDFS
0.0 0.2 0.4 0.6 0.8 1.0
Main Memory Accesses 1.8x
Agenda
5
¨ Background ¨ BDFS ¨ HATS ¨ Evaluation
Graph data structures
6
2 4 7 9 1 2 3 1 3 2
Offset array Neighbor array
3 2 1
0.8 7.9 3.6 1.2
Vertex data
Graph representation Algorithm-specific Compressed Sparse Row (CSR) Format
VO 0.0 0.2 0.4 0.6 0.8 1.0 Main Memory Accesses
Neighbors Offsets Vertex Data
Vertex-ordered (VO) schedule follows layout order
7
V0 V1
Vertex data
V3 V2
Full Spatial Locality No Temporal Locality Low Spatial Locality Low Temporal Locality PageRank on UK web graph
Storage Time Load/Store Neighbor array
1 2 3 1 3 2
¨ Simplifies scheduling and parallelism ¨ Poor locality for vertex data accesses
Agenda
8
¨ Background ¨ BDFS ¨ HATS ¨ Evaluation
BDFS: Bounded Depth-First Scheduling
9
¨ Vertex data accesses have high potential temporal locality ¨ Following community structure helps harness this locality ¨ BDFS performs a series of bounded depth-first explorations ¨ Traversal starts at vertex with id 0 ¨ Processes all edges of first community
before moving to second
¨ Divide-and-conquer nature of BDFS
¤ Small depth bounds capture most locality ¤ Good locality at all cache levels
BDFS reduces total main memory accesses
10 BDFS
Low Spatial Locality High Temporal Locality Lower Spatial Locality No Temporal Locality
VO BDFS 0.0 0.2 0.4 0.6 0.8 1.0 Main Memory Accesses
Neighbors Offsets Vertex Data
PageRank on UK web graph
Time Vertex data
High Spatial Locality No Temporal Locality Low Spatial Locality Low Temporal Locality
Storage Time Neighbor array VO
BDFS in software does not improve performance
11
¨ Scheduling overheads negate the
benefits of better locality
¨ Higher instruction count ¨ Limited ILP and MLP
¤ Interleaved execution of traversal
scheduling and edge processing
¤ Unpredictable data-dependent branches
VO BDFS 0.0 0.2 0.4 0.6 0.8 1.0
Main Memory Accesses
VO BDFS 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Instructions
VO BDFS 0.0 0.2 0.4 0.6 0.8 1.0 1.2
Execution Time
Memory Accesses Instructions Execution Time
Agenda
12
¨ Background ¨ BDFS ¨ HATS ¨ Evaluation
HATS: Hardware Accelerated Traversal Scheduling 13
¨ Decouples traversal scheduling from edge processing logic ¨ Small hardware unit near each core to perform traversal scheduling ¨ General-purpose core runs algorithm-specific edge processing logic ¨ HATS is decoupled from the core and runs ahead of it
L2 L1 Core L2 L1 Core
…
Shared L3 Main Memory
HATS HATS
HATS operation and design
14
HATS
Config. FIFO Edge Buffer
L1 Core
Prefetches HATS accs. Core accs.
L2
Scan Fetch Offsets Fetch Neighbors Prefetch
VO-HATS Stack
Exploration FSM
Scan Fetch Offsets Fetch Neighbors Prefetch
BDFS-HATS
HATS costs
15
¨ Adds only one new instruction
¤ Fetches edge from FIFO buffer to core registers
¨ Very cheap and energy-efficient over a general-purpose core
¤ RTL synthesis with a 65nm process and 1GHz target frequency
ASIC FPGA Area TDP Area 0.4% of core 0.2% of core 3200 LUTs
HATS benefits
16
¨ Reduces work for general-purpose core for VO ¨ Enables sophisticated scheduling like BDFS ¨ Performs accurate indirect prefetching of vertex data ¨ Accelerates a wide range of algorithms ¨ Requires changes to graph framework only, not algorithm code
Agenda
17
¨ Background ¨ BDFS ¨ HATS ¨ Evaluation
Evaluation methodology
18
¨ Event-driven simulation using zsim ¨ 16-core processor
¤ Haswell-like OOO cores ¤ 32 MB L3 cache ¤ 4 memory controllers
¨ IMP [Yu, MICRO’15]
¤ Indirect Memory Prefetcher ¤ Configured with graph data
structure information for accurate prefetching
¨ 5 applications from Ligra framework
¤ PageRank (PR) ¤ PageRank Delta (PRD) ¤ Connected Components (CC) ¤ Radii Estimation (RE) ¤ Maximal Independent Set (MIS)
¨ 5 large real-world graph inputs
¤ Millions of vertices ¤ Billions of edges
PR PRD CC RE MIS 20 40 60 80 100 120 Speedup over VO (%) IMP VO-HATS BDFS-HATS
HATS improves performance significantly
19
¨ IMP improves performance by
hiding latency
¨ VO-HATS outperforms IMP by
- ffloading traversal scheduling
from general-purpose core
¨ BDFS-HATS gives further gains
by reducing memory accesses
PR PRD CC RE MIS 20 40 60 80 100 120 Speedup over VO (%) IMP VO-HATS BDFS-HATS PR PRD CC RE MIS 20 40 60 80 100 120 Speedup over VO (%) IMP VO-HATS BDFS-HATS PR PRD CC RE MIS 20 40 60 80 100 120 Speedup over VO (%) IMP VO-HATS BDFS-HATS
PR PRD CC RE MIS 0.0 0.2 0.4 0.6 0.8 1.0 Normalized energy VO IMP VO-HATS BDFS-HATS
HATS reduces both on-chip and off-chip energy
20
¨ IMP reduces static energy due
to faster execution time
¨ VO-HATS reduces core energy
due to lower instruction count
¨ BDFS-HATS reduces memory
energy due to better locality
PR PRD CC RE MIS 0.0 0.2 0.4 0.6 0.8 1.0 Normalized energy VO IMP VO-HATS BDFS-HATS PR PRD CC RE MIS 0.0 0.2 0.4 0.6 0.8 1.0 Normalized energy VO IMP VO-HATS BDFS-HATS PR PRD CC RE MIS 0.0 0.2 0.4 0.6 0.8 1.0 Normalized energy VO IMP VO-HATS BDFS-HATS PR PRD CC RE MIS 0.0 0.2 0.4 0.6 0.8 1.0 Normalized energy VO IMP VO-HATS BDFS-HATS
Off-chip (Memory) On-chip (Core + Cache)
See paper for more results
21
¨ HATS on an on-chip reconfigurable fabric
¤ Parallelism enhancements to maintain throughput at slower clock cycle
¨ Sensitivity to on-chip location of HATS (L1, L2, LLC) ¨ Adaptive-HATS
¤ Avoids performance loss for graphs with no community structure
¨ HATS versus other locality optimizations
Conclusion
22
¨ Graph processing is bottlenecked by main memory accesses ¨ BDFS exploits community structure to improve cache locality ¨ HATS accelerates traversal scheduling to make BDFS practical