[PPT] - EcoRNN : Efficient Computing of LSTM RNN on GPUs Bojian Zheng PowerPoint Presentation

SLIDE 1

EcoRNN: Efficient Computing of LSTM RNN on GPUs

Bojian Zheng (Graduate Student), Gennady Pekhimenko (Advisor) bojian,pekhimenko@cs.toronto.edu

EcoSystem Research Group, Department of Computer Science University of Toronto www.cs.toronto.edu/ecosystem

The 51st Annual IEEE/ACM International Symposium on Microarchitecture, 2018, Fukuoka, Japan

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 1 / 12

SLIDE 2

Background Sequence Learning

Background: Sequence Learning

Machine Translation

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 2 / 12

SLIDE 3

Background Sequence Learning

Background: Sequence Learning

Machine Translation Speech Recognition

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 2 / 12

SLIDE 4

Background Sequence Learning

Background: Sequence Learning

Machine Translation Speech Recognition

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 2 / 12

SLIDE 5

Background LSTM RNN

Background.Long-Short-Term-Memory (LSTM) Recurrent-Neural-Network (RNN)

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 3 / 12

SLIDE 6

Background LSTM RNN

Background.Long-Short-Term-Memory (LSTM) Recurrent-Neural-Network (RNN)

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 3 / 12

SLIDE 7

Problem Statement Performance

Problem Statement: (1) Performance

✖ Default has cudaLaunch overhead. ✖ CuDNN is closed-source, limits innovation.

Reference: cuDNN LSTM RNN. Appleyard et al.

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 4 / 12

SLIDE 8

Problem Statement Performance

Problem Statement: (1) Performance

✖ Default has cudaLaunch overhead. ✖ CuDNN is closed-source, limits innovation.

Reference: cuDNN LSTM RNN. Appleyard et al.

Kernel Fusion

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 4 / 12

SLIDE 9

Problem Statement Memory Capacity

Problem Statement: (2) Memory Capacity

4 8 16 32 64 25 50 75 100 Throughpt (samples/s) Mini-batch size ResNet-50 (TF) ResNet-50 (MXNet) ResNet-50 (CNTK) 4 8 16 32 64 128 100 200 300 400 Throughpt (samples/s) Mini-batch size NMT (TF) Sockeye (MXNet)

Reference: TBD: DNN Training Benchmark Suite. Zhu et al.

Training throughput in ResNet-50 saturates at large batch size.

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 5 / 12

SLIDE 10

Problem Statement Memory Capacity

Problem Statement: (2) Memory Capacity

4 8 16 32 64 25 50 75 100 Throughpt (samples/s) Mini-batch size ResNet-50 (TF) ResNet-50 (MXNet) ResNet-50 (CNTK) 4 8 16 32 64 128 100 200 300 400 Throughpt (samples/s) Mini-batch size NMT (TF) Sockeye (MXNet)

Reference: TBD: DNN Training Benchmark Suite. Zhu et al.

Training throughput in machine translation model increases almost linearly.

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 5 / 12

SLIDE 11

Problem Statement Memory Capacity

Problem Statement: (2) Memory Capacity

4 8 16 32 64 25 50 75 100 Throughpt (samples/s) Mini-batch size ResNet-50 (TF) ResNet-50 (MXNet) ResNet-50 (CNTK) 4 8 16 32 64 128 100 200 300 400 Throughpt (samples/s) Mini-batch size NMT (TF) Sockeye (MXNet)

Reference: TBD: DNN Training Benchmark Suite. Zhu et al.

✖ RNN training is Memory Capacity-bounded.

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 5 / 12

SLIDE 12

EcoRNN Full Vision

EcoRNN is a new open-source implementation that has performance comparable with or even better than CuDNN. It has smaller memory footprint and supports auto-tuning.

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 6 / 12

SLIDE 13

EcoRNN Full Vision

EcoRNN is a new open-source implementation that has performance comparable with or even better than CuDNN. It has smaller memory footprint and supports auto-tuning.

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 6 / 12

SLIDE 14

EcoRNN Full Vision

EcoRNN is a new open-source implementation that has performance comparable with or even better than CuDNN. It has smaller memory footprint and supports auto-tuning. All changes are transparent to the programmers.

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 6 / 12

SLIDE 15

Preliminary Results Performance

Preliminary Results: (1) Performance

The runtime bottleneck is FC layers.

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 7 / 12

SLIDE 16

Preliminary Results Performance

Preliminary Results: (1) Performance

The runtime bottleneck is FC layers. Data Layout Optmization ⇒ Data layout optimization improves cache hit rate.

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 7 / 12

SLIDE 17

Preliminary Results Performance

Preliminary Results: (1) Performance

Training Throughput Comparison on the MXNet Language Modeling Benchmark ✓ Up to 2 × faster than Default, and ✓ Up to 1.3× faster than CuDNN.

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 8 / 12

SLIDE 18

Preliminary Results Memory Capacity

Preliminary Results: (2) Memory Capacity

Machine Translation Memory Consumption Profile of the Machine Translation Model The memory bottleneck is Features Maps of Attention and RNN Layers.

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 9 / 12

SLIDE 19

Future Work

Weight Parameter Reuse

Same observation made by Baidu Persistent RNN. ✖ Inflexibility: Difficult to port to new cell types and architectures

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 10 / 12

SLIDE 20

Future Work

Weight Parameter Reuse

Same observation made by Baidu Persistent RNN. ✖ Inflexibility: Difficult to port to new cell types and architectures ⇐ Machine Learning Compilers (e.g., TVM, XLA).

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 10 / 12

SLIDE 21

Future Work

Weight Parameter Reuse

Same observation made by Baidu Persistent RNN. ✖ Inflexibility: Difficult to port to new cell types and architectures ⇐ Machine Learning Compilers (e.g., TVM, XLA).

Memory Compression

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 10 / 12

SLIDE 22

Future Work

Weight Parameter Reuse

Same observation made by Baidu Persistent RNN. ✖ Inflexibility: Difficult to port to new cell types and architectures ⇐ Machine Learning Compilers (e.g., TVM, XLA).

Memory Compression ⇐ Gist (Jain et al., ISCA’18)

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 10 / 12

SLIDE 23

Summary

Problem Statement

✖ Performance, ✖ Memory Capacity

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 11 / 12

SLIDE 24

Summary

Problem Statement

✖ Performance, ✖ Memory Capacity

Key Observations

Default suffers from cudaLaunch overhead ⇐ Kernel Fusion. CuDNN has low cache-utilization ⇐ Data Layout Optimization.

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 11 / 12

SLIDE 25

Summary

Problem Statement

✖ Performance, ✖ Memory Capacity

Key Observations

Default suffers from cudaLaunch overhead ⇐ Kernel Fusion. CuDNN has low cache-utilization ⇐ Data Layout Optimization.

Future Work

Weight Parameter Reuse ⇐ Machine Learning Compilers The memory bottleneck in machine translation model is Feature Maps of Attention and RNN Layers ⇐ Gist.

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 11 / 12

SLIDE 26

Summary

Backup Slide

Experimental Settings

CUDA Toolkit 8, cuDNN 6, MXNet Ver. 0.11.0.

DeepSpeech2 Training Throughput

1 2 3 4 1 2 3 4 5 Throughput Mini-batch size Deep Speech 2 (MXNet)

B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 12 / 12