AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems - - PowerPoint PPT Presentation

autotm automatic tensor movement in heterogeneous memory
SMART_READER_LITE
LIVE PREVIEW

AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems - - PowerPoint PPT Presentation

AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems using Integer Linear Programming Mark Hildebrand 1 , Jawad Khan 2 , Sanjeev Trika 2 , Jason Lowe-Power 1 , Venkatesh Akella 1 1 University of California, Davis 2 Intel Corporation


slide-1
SLIDE 1

AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems using Integer Linear Programming

Mark Hildebrand1, Jawad Khan2, Sanjeev Trika2, Jason Lowe-Power1, Venkatesh Akella1

1 University of California, Davis 2 Intel Corporation

https://github.com/darchr/AutoTM March 12, 2020

1/29

slide-2
SLIDE 2

Executive Summary

Problem

Automatic two-level memory management for Deep Neural Networks

Idea

  • Profile Guided Optimization
  • Model as an Integer Linear Program (ILP)

Results

  • Replace 50-80% DRAM with NVDIMMs with geometric mean 27.1% performance

loss.

  • 3x better performance than real hardware cache.

2/29

slide-3
SLIDE 3

Outline

Background AutoTM Profiling ILP Modeling Results Wrap Up

3/29

slide-4
SLIDE 4

Why Deep Neural Networks

4/29

slide-5
SLIDE 5

Why Deep Neural Networks

Image: https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/ 4/29

slide-6
SLIDE 6

Why Deep Neural Networks

to train large models on a single machine? Can we use multiple levels of memory

Image: https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/ 4/29

slide-7
SLIDE 7

Heterogeneous Memory Systems

  • Two types of memory.
  • Same memory controller.
  • Both are byte addressable.
  • NVDIMMs for high capacity and low cost

NVDIMM Style

5/29

slide-8
SLIDE 8

Heterogeneous Memory Systems

  • Two types of memory.
  • Same memory controller.
  • Both are byte addressable.
  • NVDIMMs for high capacity and low cost

Challenges

  • All tensors in NVDIMMs memory is too slow.
  • DRAM as a cache for NVDIMMs also too slow.
  • Intelligent memory management required.

NVDIMM Style

5/29

slide-9
SLIDE 9

Outline

Background AutoTM Profiling ILP Modeling Results Wrap Up

6/29

slide-10
SLIDE 10

AutoTM

Goal

Minimize execution time

  • Arbitrary computation graph
  • Size constraint on fast memory

7/29

slide-11
SLIDE 11

AutoTM

Goal

Minimize execution time

  • Arbitrary computation graph
  • Size constraint on fast memory

How

  • Place tensors in fast or slow memory.
  • Optimal tensor movement

7/29

slide-12
SLIDE 12

AutoTM

Goal

Minimize execution time

  • Arbitrary computation graph
  • Size constraint on fast memory

How

  • Place tensors in fast or slow memory.
  • Optimal tensor movement

Strategy

  • Profile kernel performance.
  • Model tensor assignment as ILP.

7/29

slide-13
SLIDE 13

Kernel Profiling

Profile performance of kernels for all tensor IO locations. Kernel IO Tensor Locations K2 T1 T2 T3 DRAM DRAM DRAM DRAM DRAM PMM DRAM PMM DRAM DRAM PMM PMM PMM DRAM DRAM PMM DRAM PMM PMM PMM DRAM PMM PMM PMM

Table: Profile space for kernel K2.

8/29

slide-14
SLIDE 14

Kernel Profiling

Profile performance of kernels for all tensor IO locations. Kernel IO Tensor Locations K2 T1 T2 T3 DRAM DRAM DRAM DRAM DRAM PMM DRAM PMM DRAM DRAM PMM PMM PMM DRAM DRAM PMM DRAM PMM PMM PMM DRAM PMM PMM PMM

Table: Profile space for kernel K2.

DRAM DRAM DRAM DRAM DRAM PMM DRAM PMM DRAM DRAM PMM PMM PMM DRAM DRAM PMM DRAM PMM PMM PMM DRAM PMM PMM PMM 1 2 Data In: Weight: Data Out: Performance relative to all IO in DRAM

8/29

slide-15
SLIDE 15

Tensor Lifetime Flow Network

Path of flow through the graph describes where a tensor’s memory location throughout its lifetime.

9/29

slide-16
SLIDE 16

Tensor Lifetime Flow Network

Path of flow through the graph describes where a tensor’s memory location throughout its lifetime.

10/29

slide-17
SLIDE 17

Tensor Lifetime Flow Network

Path of flow through the graph describes where a tensor’s memory location throughout its lifetime.

11/29

slide-18
SLIDE 18

Tensor Lifetime Flow Network

Path of flow through the graph describes where a tensor’s memory location throughout its lifetime.

12/29

slide-19
SLIDE 19

Tensor Lifetime Flow Network

Path of flow through the graph describes where a tensor’s memory location throughout its lifetime.

13/29

slide-20
SLIDE 20

Tensor Lifetime Flow Network

Path of flow through the graph describes where a tensor’s memory location throughout its lifetime.

14/29

slide-21
SLIDE 21

Tensor Lifetime Flow Network

Path of flow through the graph describes where a tensor’s memory location throughout its lifetime.

15/29

slide-22
SLIDE 22

ILP Modeling

16/29

slide-23
SLIDE 23

ILP Modeling

Objective Function

Computation time min

  • k∈K

ρk +

  • t∈T

Mt

16/29

slide-24
SLIDE 24

ILP Modeling

Objective Function

Computation time min

  • k∈K

ρk

Kernel Execution Time

+

  • t∈T

Mt K : Set of Kernels ρk : Run time of kernel k

17/29

slide-25
SLIDE 25

ILP Modeling

Objective Function

Computation time min

  • k∈K

ρk

Kernel Execution Time

+

  • t∈T

Mt K : Set of Kernels ρk : Run time of kernel k Example Run time of kernel k2

17/29

slide-26
SLIDE 26

ILP Modeling

Objective Function

Computation time min

  • k∈K

ρk

Kernel Execution Time

+

  • t∈T

Mt

Tensor Movement Time

T : Set of Tensors Mt : Time moving tensor t

18/29

slide-27
SLIDE 27

ILP Modeling

Objective Function

Computation time min

  • k∈K

ρk

Kernel Execution Time

+

  • t∈T

Mt

Tensor Movement Time

T : Set of Tensors Mt : Time moving tensor t Example Time moving tensor t1

18/29

slide-28
SLIDE 28

ILP Modeling

Objective Function

Computation time min

  • k∈K

ρk

Kernel Execution Time

+

  • t∈T

Mt

Tensor Movement Time

Constraints

Limit DRAM at each kernel

  • t∈L(k)

tI DRAM

t,k

≤ Limit ∀k

19/29

slide-29
SLIDE 29

Variations of AutoTM

Name Description PMM System GPU System Static Tensor’s can’t move ✓ ✗ Synchronous Tensor’s move but block computation ✓ ✓ Asynchronous Tensor movement con- current with computa- tion

20/29

slide-30
SLIDE 30

Outline

Background AutoTM Profiling ILP Modeling Results Wrap Up

21/29

slide-31
SLIDE 31

Experiments!

Software

  • Modified the ngraph1 compiler.
  • Julia’s JuMP2 package for ILP

modeling.

  • Gurobi3 as the ILP solver.

Hardware

  • 1.5 TB OptaneTM DC PMM
  • 384 GiB DRAM

1https://github.com/NervanaSystems/ngraph 2https://github.com/JuliaOpt/JuMP.jl 3gurobi.com 22/29

slide-32
SLIDE 32

Experiments!

Software

  • Modified the ngraph1 compiler.
  • Julia’s JuMP2 package for ILP

modeling.

  • Gurobi3 as the ILP solver.

Hardware

  • 1.5 TB OptaneTM DC PMM
  • 384 GiB DRAM

Workloads − − − − − − − − − − − − − − →

Conventional Batchsize Memory (GB) Inception v4 1024 111 Vgg 19 2048 143 Resnet 200 512 132 DenseNet 264 512 115

1https://github.com/NervanaSystems/ngraph 2https://github.com/JuliaOpt/JuMP.jl 3gurobi.com 22/29

slide-33
SLIDE 33

Experiments!

Software

  • Modified the ngraph1 compiler.
  • Julia’s JuMP2 package for ILP

modeling.

  • Gurobi3 as the ILP solver.

Hardware

  • 1.5 TB OptaneTM DC PMM
  • 384 GiB DRAM

Workloads − − − − − − − − − − − − − − →

Conventional Batchsize Memory (GB) Inception v4 1024 111 Vgg 19 2048 143 Resnet 200 512 132 DenseNet 264 512 115 Large Batchsize Memory (GB) Inception v4 6144 659 Vgg 416 128 658 Resnet 200 2560 651 DenseNet 264 3072 688

1https://github.com/NervanaSystems/ngraph 2https://github.com/JuliaOpt/JuMP.jl 3gurobi.com 22/29

slide-34
SLIDE 34

Scaling Performance - Inception V4

Performance of Inception v4 - Batchsize 1024

20 40 60 80 100 120 1 2 3 4 5 Dram Limit (GB) Lower is Better Slowdown Lower is Better synchronous

23/29

slide-35
SLIDE 35

Scaling Performance - Inception V4

Performance of Inception v4 - Batchsize 1024

20 40 60 80 100 120 1 2 3 4 5 Dram Limit (GB) Lower is Better Slowdown Lower is Better synchronous

24/29

slide-36
SLIDE 36

Scaling Performance - Inception V4

Performance of Inception v4 - Batchsize 1024

20 40 60 80 100 120 1 2 3 4 5 Dram Limit (GB) Lower is Better Slowdown Lower is Better synchronous

  • Just using PMM is too slow.

24/29

slide-37
SLIDE 37

Scaling Performance - Inception V4

Performance of Inception v4 - Batchsize 1024

20 40 60 80 100 120 1 2 3 4 5 Dram Limit (GB) Lower is Better Slowdown Lower is Better synchronous

25/29

slide-38
SLIDE 38

Scaling Performance - Inception V4

Performance of Inception v4 - Batchsize 1024

20 40 60 80 100 120 1 2 3 4 5 Dram Limit (GB) Lower is Better Slowdown Lower is Better synchronous

  • Best performance when working-set fits in memory.

25/29

slide-39
SLIDE 39

Comparison Against 2LM

2LM DRAM Cache

26/29

slide-40
SLIDE 40

Comparison Against 2LM

Vgg416 (320) Inception v4 (6144) Resnet200 (2560) DenseNet 264 (3072) 1 2 3 Speedup over 2LM Higher is Better static-AutoTM sync-AutoTM

2LM DRAM Cache

26/29

slide-41
SLIDE 41

Comparison Against 2LM

Vgg416 (320) Inception v4 (6144) Resnet200 (2560) DenseNet 264 (3072) 1 2 3 Speedup over 2LM Higher is Better static-AutoTM sync-AutoTM

2LM DRAM Cache

  • Avoid Dirty Writebacks
  • Lower Memory Contention

26/29

slide-42
SLIDE 42

Comparison Against 2LM

Vgg416 (320) Inception v4 (6144) Resnet200 (2560) DenseNet 264 (3072) 1 2 3 Speedup over 2LM Higher is Better static-AutoTM sync-AutoTM

2LM DRAM Cache

Software management

  • utperforms hardware

management by up to 3x.

26/29

slide-43
SLIDE 43

Outline

Background AutoTM Profiling ILP Modeling Results Wrap Up

27/29

slide-44
SLIDE 44

Limitations

  • Static computation graphs.
  • Kernel profiling overhead.
  • ILP solution times.
  • ILP solution may be hard to interpret.

28/29

slide-45
SLIDE 45

Limitations

  • Static computation graphs.
  • Kernel profiling overhead.
  • ILP solution times.
  • ILP solution may be hard to interpret.

Vgg19 Inception v4 Resnet200 DenseNet 264 101 102 103 ILP Solution Time (s) Logarithmic static-AutoTM sync-AutoTM 28/29

slide-46
SLIDE 46

Conclusion

AutoTM: A technique for managing tensors in heterogeneous memory systems.

  • Profiling for Kernel Performance.
  • Use ILP to optimally assign tensor location and movement.
  • Three formulations: Static, Synchronous, Asynchronous.

We show

  • Reduce DRAM requirement.
  • Significant performance improvement over hardware solutions.

Code Available: https://github.com/darchr/AutoTM

29/29

slide-47
SLIDE 47

Common Questions

30/29

slide-48
SLIDE 48

Asynchronous Movement on PMMs

  • Interference between DRAM and PMM.
  • Low bandwidth and difficulty of DMA.
  • Performance of kernels greatly impacted due to copy kernels.

31/29

slide-49
SLIDE 49

GPU

1 2 3 4 5 6 7 8 9 10 64 128 256 32 64 128 32 64 128 64 128 Inception v4 Resnet200 DenseNet 264 Vgg19 Speedup over CudaMallocManaged synchronous asynchronous

  • racle

32/29

slide-50
SLIDE 50

RNNs - more complex models

  • AutoTM is limited to static computation graphs.
  • RNNs have dynamic behavior (i.e. unrolling based on sequence length).
  • RNNs can be implemented statically.
  • Key ideas from AutoTM can be used for dynamic workloads.

20 40 60 80 100 120 0.5 1 DRAM Limit (GB) Percent of Kernel IO in DRAM static: input tensors static: output tensors synchronous: input tensors synchronous: output tensors

33/29

slide-51
SLIDE 51

Concluding Conclusion

AutoTM: A technique for managing tensors in heterogeneous memory systems.

  • Profiling for Kernel Performance.
  • Use ILP to optimally assign tensor location and movement.
  • Three formulations: Static, Synchronous, Asynchronous.

We show

  • Reduce DRAM requirement.
  • Significant performance improvement over hardware solutions.

Code Available: https://github.com/darchr/AutoTM

34/29