EcoRNN : Efficient Computing of LSTM RNN on GPUs Bojian Zheng - - PowerPoint PPT Presentation

ecornn efficient computing of lstm rnn on gpus
SMART_READER_LITE
LIVE PREVIEW

EcoRNN : Efficient Computing of LSTM RNN on GPUs Bojian Zheng - - PowerPoint PPT Presentation

EcoRNN : Efficient Computing of LSTM RNN on GPUs Bojian Zheng (Graduate Student), Gennady Pekhimenko (Advisor) bojian,pekhimenko@cs.toronto.edu EcoSystem Research Group, Department of Computer Science University of Toronto


slide-1
SLIDE 1

EcoRNN: Efficient Computing of LSTM RNN on GPUs

Bojian Zheng (Graduate Student), Gennady Pekhimenko (Advisor) bojian,pekhimenko@cs.toronto.edu

EcoSystem Research Group, Department of Computer Science University of Toronto www.cs.toronto.edu/ecosystem

The 51st Annual IEEE/ACM International Symposium on Microarchitecture, 2018, Fukuoka, Japan

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 1 / 12

slide-2
SLIDE 2

Background Sequence Learning

Background: Sequence Learning

Machine Translation

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 2 / 12

slide-3
SLIDE 3

Background Sequence Learning

Background: Sequence Learning

Machine Translation Speech Recognition

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 2 / 12

slide-4
SLIDE 4

Background Sequence Learning

Background: Sequence Learning

Machine Translation Speech Recognition

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 2 / 12

slide-5
SLIDE 5

Background LSTM RNN

Background.Long-Short-Term-Memory (LSTM) Recurrent-Neural-Network (RNN)

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 3 / 12

slide-6
SLIDE 6

Background LSTM RNN

Background.Long-Short-Term-Memory (LSTM) Recurrent-Neural-Network (RNN)

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 3 / 12

slide-7
SLIDE 7

Problem Statement Performance

Problem Statement: (1) Performance

✖ Default has cudaLaunch overhead. ✖ CuDNN is closed-source, limits innovation.

Reference: cuDNN LSTM RNN. Appleyard et al.

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 4 / 12

slide-8
SLIDE 8

Problem Statement Performance

Problem Statement: (1) Performance

✖ Default has cudaLaunch overhead. ✖ CuDNN is closed-source, limits innovation.

Reference: cuDNN LSTM RNN. Appleyard et al.

Kernel Fusion

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 4 / 12

slide-9
SLIDE 9

Problem Statement Memory Capacity

Problem Statement: (2) Memory Capacity

4 8 16 32 64 25 50 75 100 Throughpt (samples/s) Mini-batch size ResNet-50 (TF) ResNet-50 (MXNet) ResNet-50 (CNTK) 4 8 16 32 64 128 100 200 300 400 Throughpt (samples/s) Mini-batch size NMT (TF) Sockeye (MXNet)

Reference: TBD: DNN Training Benchmark Suite. Zhu et al.

Training throughput in ResNet-50 saturates at large batch size.

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 5 / 12

slide-10
SLIDE 10

Problem Statement Memory Capacity

Problem Statement: (2) Memory Capacity

4 8 16 32 64 25 50 75 100 Throughpt (samples/s) Mini-batch size ResNet-50 (TF) ResNet-50 (MXNet) ResNet-50 (CNTK) 4 8 16 32 64 128 100 200 300 400 Throughpt (samples/s) Mini-batch size NMT (TF) Sockeye (MXNet)

Reference: TBD: DNN Training Benchmark Suite. Zhu et al.

Training throughput in machine translation model increases almost linearly.

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 5 / 12

slide-11
SLIDE 11

Problem Statement Memory Capacity

Problem Statement: (2) Memory Capacity

4 8 16 32 64 25 50 75 100 Throughpt (samples/s) Mini-batch size ResNet-50 (TF) ResNet-50 (MXNet) ResNet-50 (CNTK) 4 8 16 32 64 128 100 200 300 400 Throughpt (samples/s) Mini-batch size NMT (TF) Sockeye (MXNet)

Reference: TBD: DNN Training Benchmark Suite. Zhu et al.

✖ RNN training is Memory Capacity-bounded.

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 5 / 12

slide-12
SLIDE 12

EcoRNN Full Vision

EcoRNN Full Vision

EcoRNN is a new open-source implementation that has performance comparable with or even better than CuDNN. It has smaller memory footprint and supports auto-tuning.

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 6 / 12

slide-13
SLIDE 13

EcoRNN Full Vision

EcoRNN Full Vision

EcoRNN is a new open-source implementation that has performance comparable with or even better than CuDNN. It has smaller memory footprint and supports auto-tuning.

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 6 / 12

slide-14
SLIDE 14

EcoRNN Full Vision

EcoRNN Full Vision

EcoRNN is a new open-source implementation that has performance comparable with or even better than CuDNN. It has smaller memory footprint and supports auto-tuning. All changes are transparent to the programmers.

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 6 / 12

slide-15
SLIDE 15

Preliminary Results Performance

Preliminary Results: (1) Performance

The runtime bottleneck is FC layers.

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 7 / 12

slide-16
SLIDE 16

Preliminary Results Performance

Preliminary Results: (1) Performance

The runtime bottleneck is FC layers. Data Layout Optmization ⇒ Data layout optimization improves cache hit rate.

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 7 / 12

slide-17
SLIDE 17

Preliminary Results Performance

Preliminary Results: (1) Performance

Training Throughput Comparison on the MXNet Language Modeling Benchmark ✓ Up to 2 × faster than Default, and ✓ Up to 1.3× faster than CuDNN.

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 8 / 12

slide-18
SLIDE 18

Preliminary Results Memory Capacity

Preliminary Results: (2) Memory Capacity

Machine Translation Memory Consumption Profile of the Machine Translation Model The memory bottleneck is Features Maps of Attention and RNN Layers.

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 9 / 12

slide-19
SLIDE 19

Future Work

Future Work

Weight Parameter Reuse

Same observation made by Baidu Persistent RNN. ✖ Inflexibility: Difficult to port to new cell types and architectures

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 10 / 12

slide-20
SLIDE 20

Future Work

Future Work

Weight Parameter Reuse

Same observation made by Baidu Persistent RNN. ✖ Inflexibility: Difficult to port to new cell types and architectures ⇐ Machine Learning Compilers (e.g., TVM, XLA).

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 10 / 12

slide-21
SLIDE 21

Future Work

Future Work

Weight Parameter Reuse

Same observation made by Baidu Persistent RNN. ✖ Inflexibility: Difficult to port to new cell types and architectures ⇐ Machine Learning Compilers (e.g., TVM, XLA).

Memory Compression

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 10 / 12

slide-22
SLIDE 22

Future Work

Future Work

Weight Parameter Reuse

Same observation made by Baidu Persistent RNN. ✖ Inflexibility: Difficult to port to new cell types and architectures ⇐ Machine Learning Compilers (e.g., TVM, XLA).

Memory Compression ⇐ Gist (Jain et al., ISCA’18)

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 10 / 12

slide-23
SLIDE 23

Summary

Summary

Problem Statement

✖ Performance, ✖ Memory Capacity

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 11 / 12

slide-24
SLIDE 24

Summary

Summary

Problem Statement

✖ Performance, ✖ Memory Capacity

Key Observations

Default suffers from cudaLaunch overhead ⇐ Kernel Fusion. CuDNN has low cache-utilization ⇐ Data Layout Optimization.

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 11 / 12

slide-25
SLIDE 25

Summary

Summary

Problem Statement

✖ Performance, ✖ Memory Capacity

Key Observations

Default suffers from cudaLaunch overhead ⇐ Kernel Fusion. CuDNN has low cache-utilization ⇐ Data Layout Optimization.

Future Work

Weight Parameter Reuse ⇐ Machine Learning Compilers The memory bottleneck in machine translation model is Feature Maps of Attention and RNN Layers ⇐ Gist.

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 11 / 12

slide-26
SLIDE 26

Summary

Backup Slide

Experimental Settings

CUDA Toolkit 8, cuDNN 6, MXNet Ver. 0.11.0.

DeepSpeech2 Training Throughput

1 2 3 4 1 2 3 4 5 Throughput Mini-batch size Deep Speech 2 (MXNet)

  • B. Zheng, G. Pekhimenko (EcoSystem)

EcoRNN MICRO 51 SRC 12 / 12