Understanding and Tackling the Hidden Memory Latency for Edge-based - - PowerPoint PPT Presentation

understanding and tackling the hidden memory
SMART_READER_LITE
LIVE PREVIEW

Understanding and Tackling the Hidden Memory Latency for Edge-based - - PowerPoint PPT Presentation

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL) Understanding and Tackling the Hidden Memory Latency for Edge-based Heterogeneous Platform Zhendong Wang, Zhen Wang, Cong Liu, and Yang Hu


slide-1
SLIDE 1

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL)

Understanding and Tackling the Hidden Memory Latency for Edge-based Heterogeneous Platform

Zhendong Wang, Zhen Wang, Cong Liu, and Yang Hu

Pervasive and Emerging Architecture Research Lab, UT Dallas

Presented by Zhendong Wang HotEdge 2020

  • Jun. 25, 2020
slide-2
SLIDE 2

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL)

  • 1. Background
  • 2. Motivation and Challenges
  • 3. Proposed design
  • 4. Evaluation and Conclusion

Edge intelligence Integrated GPU

2

Latency

data allocation CPU data initialization GPU Computation kernel GPU

slide-3
SLIDE 3

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL)

ML/DNN enables a series of edge applications

3

  • 1. Background

backed by

Integrated GPU

Deployment in iGPUs are stymied (1) Limited memory space e.g., TX2: 8GPU, AGX:16GB (2) Application stringent latency requirements e.g., driving automation is safety-critical and latency-sensitive Widely deployed in integrated CPU/GPU(iGPU) platform

Weight Power Size

Rigorous requirements of memory footprint and processing latency

for iGPU platform

slide-4
SLIDE 4

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL)

4

  • 1. Background

Unified Memory (UM) Management model has relieved the situation (2) Save memory footprint (1) Ease memory management CUDA: cudaMallocManaged()

Is current Unified Memory (UM) model good enough?

slide-5
SLIDE 5

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL)

5

Limits of current Unified Memory (UM) model – hidden latency

  • 2. Motivation

t Def

CPU GPU

Init Execution

UM

CPU GPU

Init Execution Alloc Alloc

Data processing flow under Def. and UM memory model

Def: copy-then-execute memory model UM: unified memory model

DNN YOLO2 YOL03 SSD DAVE-2

M.O.S

49K 81K 10K 250K

Autonomous driving workloads – large matrix operation scale (M.O.S.) --- Matric Addition and Matric Multiplication Kernel time

slide-6
SLIDE 6

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL)

6

Limits of current Unified Memory (UM) model – hidden latency

matrix addition matrix multiplication UM still spends excessive time on initialization Def: init ~50% latency UM: init ~90% latency

  • 2. Motivation
  • 1. Def: copy-then-execute memory model

Others = H2D copy + D2H copy + kernel time Init.: data initialization

  • 2. UM: unified memory model

Others = kernel time (No copy) Init.: data initialization

t Def

CPU GPU

Init Execution

UM

CPU GPU

Init Execution Alloc Alloc

slide-7
SLIDE 7

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL)

7

Limits of current Unified Memory (UM) model – hidden latency

matrix addition matrix multiplication

  • 2. Motivation

UM also slows down the computation kernel

Observations: Unnecessary initialize data in CPU side (1) Save initialization latency (2) Benefit kernel /overall application response performance

t Def

CPU GPU

Init Execution

UM

CPU GPU

Init Execution Alloc Alloc

Kernel time

  • 1. Def kernel = kernel execution
  • 2. UM = kernel execution + mapping latency

Kernel time

slide-8
SLIDE 8

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL)

Enhanced Unified Memory Management (eUMM)

(1) Initializing data in GPU side

Existing mechanism of legacy Unified Management model

8

  • 3. Proposed design

GPU-side data initialization in eUMM

slide-9
SLIDE 9

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL)

(2) Prefetch-enhanced GPU-Init performance

9

Enhanced Unified Memory Management (eUMM)

  • 3. Proposed design
slide-10
SLIDE 10

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL)

10

  • 4. Evaluation

Platform: Jetson TX2, Xavier AGX Benchmark: matrix addition, matrix multiplication, Needleman-Wunsch (NW), random access (RA)

Computation kernel is not longer slowed down Faster data initialization

slide-11
SLIDE 11

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL)

  • 5. Conclusion

Characterization of legacy unified memory management ◆Initialization latency ◆Kernel launch latency An enhanced data initialization model based on Unified Memory management (eUMM) ◆Initializing data in GPU side ◆Overlapping page mapping with data initialization to further reduce latency

11

slide-12
SLIDE 12

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL)

Prospect & Future work

Extend eUMM to a broad spectrum of workloads

◆Autonomous driving workloads (object detection, object tracking)

Reduce the inherent overhead of GPU-side data initialization

◆GPU-side data initialization does not outperform when data size is small

GPUDirect

◆Bypass CPU to accelerate the communication between GPU and peripheral storage

12

slide-13
SLIDE 13

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL)

Thank You

13

If you have any questions, please contact zhendong.wang@utdallas.edu