Performance Analysis of CNN Frameworks for GPUs Heehoon Kim, - - PowerPoint PPT Presentation

performance analysis of cnn frameworks for gpus
SMART_READER_LITE
LIVE PREVIEW

Performance Analysis of CNN Frameworks for GPUs Heehoon Kim, - - PowerPoint PPT Presentation

Performance Analysis of CNN Frameworks for GPUs Heehoon Kim, Hyoungwook Nam, Wookeun Jung, and Jaejin Lee Department of Computer Science and Engineering Seoul National University, Korea http://aces.snu.ac.kr The two authors contributed


slide-1
SLIDE 1

Performance Analysis of CNN Frameworks for GPUs

Heehoon Kim†, Hyoungwook Nam†, Wookeun Jung, and Jaejin Lee Department of Computer Science and Engineering Seoul National University, Korea http://aces.snu.ac.kr

†The two authors contributed equally to this work as the first authors

1

slide-2
SLIDE 2

Convolutional Neural Network Deep Learning Framework GPU Library

2

slide-3
SLIDE 3

Motivation

  • Convolutional Neural Networks (CNN) have been successful in

machine learning tasks such as visual recognition

  • Previous studies reveal performance differences among deep

learning frameworks

  • However, those studies do not identify reasons for the

differences

3

slide-4
SLIDE 4

4

100 200 300 400 500 600 Torch Theano TensorFlow CNTK Caffe Time (ms)

slide-5
SLIDE 5

Goals

  • Analyze differences in the performance characteristics of the

five deep learning frameworks in a single GPU context

  • Analyze scalability of the frameworks in the multiple GPU

context

  • Analyze performance characteristics of different convolution

algorithms for each layer

5

slide-6
SLIDE 6

Outline

  • Convolutional Neural Network
  • Deep Learning Frameworks
  • Framework Comparison
  • Multi-GPU Comparison
  • Layer-wise Analysis of Convolution Algorithms
  • Conclusions

6

slide-7
SLIDE 7

Convolutional Neural Network

7

conv n conv 1 conv 2

Inputs

fc n fc 1 fc 2

softmax

Outputs

Convolutional Feature Extractor Fully-connected Classifier

slide-8
SLIDE 8

Computational Complexity of Convolution

8

  • 𝐷 × 𝐼𝑋 × 𝑆𝑇 × 𝐿 × 𝑂 × 2 (𝑛𝑣𝑚𝑢𝑗𝑞𝑚𝑧 𝑏𝑜𝑒 𝑏𝑒𝑒)
  • Ex) 96 × 27 × 27 × 5 × 5 × 256 × 256 × 2 = 229 𝐻𝑝𝑞𝑡

Conv2 layer C = 96 (input channel) [H,W] = [13, 13] (input dimension) [R,S] = [5, 5] (kernel dimension) K = 256 (output channel) N = 256 (batch size)

slide-9
SLIDE 9

Convolution Algorithms for GPU

  • Direct Convolution
  • Straightforward, but hard to optimize
  • GEMM Convolution
  • Converts convolutions into matrix multiplications
  • Easier to optimize
  • FFT Convolution
  • Reduced computational complexity
  • 𝑃(𝐿𝑂) (Direct convolution)  𝑃(𝑂𝑚𝑝𝑕𝑂) (FFT convolution)
  • Winograd Convolution
  • Reduces the complexity of convolution like Strassen’s algorithm
  • Specific filtering algorithm is required for each kernel dimension

9

slide-10
SLIDE 10

AlexNet Model

10

  • Winner of ILSVRC 2012 (ImageNet Challenge)
  • Commonly used CNN model for benchmarking
  • Includes various kinds of layers
  • 3x3 convolution, 5x5 convolution, fully connected layers, etc.
slide-11
SLIDE 11

Training a CNN

11

Layer Input Output

Forward

Layer Gradient Data Loss

Backward Data

Layer Weight Gradient Gradient Data

Backward Gradient

  • 1 forward computation and 2 backward computations
  • Forward and backward computations are symmetric and have

the same computational cost

Layer Weight Gradient

Update Parameters

slide-12
SLIDE 12

Outline

  • Convolutional Neural Network
  • Deep Learning Frameworks
  • Framework Comparison
  • Multi-GPU Comparison
  • Layer-wise Analysis of Convolution Algorithms
  • Conclusions

12

slide-13
SLIDE 13

Five Deep Learning Frameworks

13

Framework User Interface Data Parallelism Model Parallelism Caffe protobuf, C++, Python Yes Limited CNTK BrainScript, C++, C# Yes No TensorFlow Python, C++ Yes Yes Theano Python No No Torch LuaJIT Yes Yes

  • Popular frameworks chosen by GitHub stars
  • All five frameworks use cuDNN as backend
  • Theano only supports single GPU
slide-14
SLIDE 14

cuDNN

  • Deep Neural Network library with NVIDIA CUDA
  • Provides DNN primitives
  • Convolution, pooling, normalization, activation, …
  • State-of-the-art performance
  • All five frameworks support use of cuDNN as a backend
  • Unfortunately, not open-source (distributed in binaries)

14

slide-15
SLIDE 15

System Setup

CPU 2 x Intel Xeon E5 2650@2.0GHz GPU 4 x NVIDIA Titan X (Maxwell) Main memory 128GB DDR3 GPU memory 4 x 12GB GDDR5 Operating system CentOS 7.2.1511 (Linux 3.10.0-327)

15

slide-16
SLIDE 16

Outline

  • Convolutional Neural Network
  • Deep Learning Frameworks
  • Framework Comparison
  • Multi-GPU Comparison
  • Layer-wise Analysis of Convolution Algorithms
  • Conclusions

16

slide-17
SLIDE 17

Execution Time Comparison (default setting)

17

  • Convolution layers take up more than 70% of training time
  • f: forward computation, b: backward computation

100 200 300 400 500 600 Torch Theano TensorFlow CNTK Caffe

Time (ms)

conv1f conv2f conv3f conv4f conv5f fc1f fc2f fc3f conv1b conv2b conv3b conv4b conv5b fc1b fc2b fc3b

slide-18
SLIDE 18

Options for Convolution Algorithms

18

Framework User Selectable Heuristic-based Profile-based Default Caffe No Yes No Heuristic-based CNTK No No Yes Profile-based TensorFlow No No No Heuristic-based† Theano Yes Yes Yes GEMM Torch Yes Yes Yes GEMM

  • cuDNN Get API is a heuristic based approach to choose an algorithm
  • cuDNN Find API is a profile-based approach to choose an algorithm
  • By default, Torch and Theano use GEMM convolution

†TensorFlow uses its own heuristic algorithm

slide-19
SLIDE 19

Options for Convolution Algorithms

19

  • Up to 2x speedup by providing algorithm options

100 200 300 400 500 600 Torch(Profile) Torch Theano(Profile) Theano(Heuristic) Theano(FFT) Theano Time (ms) Conv Forward FC Forward Conv Backward FC Backward

slide-20
SLIDE 20

Data Layout

20

NCHW layout NHWC layout cuDNN

transpose transpose

  • For example, cuDNN’s FFT convolution only supports NCHW
  • If the user uses another layout, TensorFlow implicitly transposes
  • Changing the layout leads to 15% speedup in TensorFlow

NHWC layout NCHW layout

50 100 150 200 250 300 TensorFlow (NCHW) TensorFlow

Time (ms)

slide-21
SLIDE 21

Unnecessary Backpropagation

21

Layer 3 Layer 2 Layer 1 Layer 0 Input

Forward Backward Data Backward Gradient

Unnecessary

  • ‘Backward Data’ is unnecessary in the first layer.
  • Caffe, CNTK, Theano
  • Automatically omitted.
  • Torch
  • User option (layer0.gradInput = nil)
  • TensorFlow
  • No options to users
slide-22
SLIDE 22

Unnecessary Backpropagation

22

100 200 300 400 500 600 Torch (w/o first) Torch

Time (ms)

  • Speedup in the backward computation of the first layer
slide-23
SLIDE 23

Optimized Results

23

  • Framework differences are not significant if carefully optimized
  • Remaining differences come from other operations, such as bias

addition and ReLU activation

100 200 300 400 500 600 Torch(Profile) Torch Theano(Profile) Theano TensorFlow (NCHW) TensorFlow CNTK Caffe Time (ms) Conv Forward FC Forward Conv Backward FC Backward

slide-24
SLIDE 24

Outline

  • Convolutional Neural Network
  • Deep Learning Frameworks
  • Framework Comparison
  • Multi-GPU Comparison
  • Layer-wise Analysis of Convolution Algorithms
  • Conclusions

24

slide-25
SLIDE 25

Data-parallel SGD

25

CNN

GPU0 GPU1 GPU2 GPU3

CNN CNN CNN Update Update Update Update

Critical path : 2logN transfer

Batch 0 Batch 1 Batch 2 Batch 3

slide-26
SLIDE 26

Multi-GPU Scalability

  • With small batches, multi-GPU is worse than a single GPU
  • Even with large batches, 4GPUs’ speedup is only around 1.5x

26 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 128 256 512 Speedup Batch size 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 128 256 512 Speedup Batch size 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 128 256 512 Speedup Batch size 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 128 256 512 Speedup Batch size 1GPU 2GPUs 4GPUs

Caffe Torch TensorFlow CNTK

slide-27
SLIDE 27

Communication-Compute Overlapping

27

Forward Transfer Transfer Backward Transfer Transfer

  • Transfer overhead is not negligible
  • Transfer as soon as gradients of each layer become available
  • TensorFlow is partly doing this

The last layer’s gradients are computed. Forward & Backward Transfer Transfer Transfer Transfer

~200ms with a batch size of 256 ~45ms (~250MB gradients, ~5GB/s)

slide-28
SLIDE 28

Reducing Amount of Data Transfer

30

Forward & Backward Transfer Transfer Transfer Transfer Forward & Backward

  • Quantization methods
  • CNTK’s 1bit-SGD (1/32 transfer)
  • Avoid fully connected layers
  • 90% of parameters reside in fully-connected

layers

  • Use 1x1 convolution layers instead of fully-

connected layers (e.g. GoogLeNet)

2.62 0.5 1 1.5 2 128 256 512 Speedup 1GPU 2GPUs 4GPUs

CNTK 1bit-SGD

slide-29
SLIDE 29

Outline

  • Convolutional Neural Network
  • Deep Learning Frameworks
  • Framework Comparison
  • Multi-GPU Comparison
  • Layer-wise Analysis of Convolution Algorithms
  • Conclusions

31

slide-30
SLIDE 30

Direct Convolution Algorithm

  • Straightforward convolution algorithm
  • Not supported by cuDNN, thus we use cuda-convnet3 for

testing

  • Easy to implement but hard to optimize
  • cuda-convnet requires CHWN tensor layout instead of NCHW
  • Computation time for forward and backward computations are

not symmetric

32

slide-31
SLIDE 31

GEMM Convolution Algorithm

33

  • Treat convolutions as vector dot products in matrix multiplication
  • Forward and backward computations are symmetric
  • Efficiently optimized, but tiling inserts unnecessary computations
slide-32
SLIDE 32

FFT Convolution Algorithm

  • FFT  CGEMM  inverse FFT == Convolution
  • In 2D convolution, computational complexity reduces from

O(𝐼𝑋𝑆𝑇) to O(𝐼𝑋 log 𝐼𝑋 )

  • Computational cost does not depend on kernel dimension
  • cuDNN FFT convolution does not support strides

34

50 100 150 200 250 conv1 conv2 conv3 conv4 conv5 Giga Operations Kernel operation counts for each convolution layer

Direct GEMM FFT Winograd Theoretical

slide-33
SLIDE 33

Winograd Convolution Algorithm

  • Based on GEMM convolution method
  • Minimal filtering algorithm for 3x3 kernel and 4x4 tiling

reduces 144 multiplications into 36 (4x difference).

  • Each kernel dimension requires own minimal filtering algorithm.
  • cuDNN 5.1 supports Winograd algorithm for 3x3 and 5x5

convolutions with no strides

35

50 100 150 200 250 conv1 conv2 conv3 conv4 conv5 Giga Operations Kernel operation counts for each convolution layer

Direct GEMM FFT Winograd Theoretical

slide-34
SLIDE 34

Computation Time Comparison

36

  • Direct algorithm shows poor performance on backward computations
  • FFT is the fastest algorithm for most of the time

50 100 150 32 64 128 256 Time (ms) Batch size

Forward Computation Time

Direct GEMM FFT Winograd 200 400 600 32 64 128 256 Time (ms) Batch size

Backward Computation Time

Direct GEMM FFT Winograd 20 40 60 80 32 64 128 256 Time (ms) Batch size

Conv3,4,5 Forward Computation Time

Direct GEMM FFT Winograd 1000 2000 3000 4000 5000 6000 32 64 128 256 Memory (MB) Batch size

VRAM Usage

Direct GEMM FFT Winograd

slide-35
SLIDE 35

Computation Time Comparison

37

  • Direct algorithm shows poor performance on backward computations
  • FFT is the fastest algorithm for most of the time
  • Winograd performs better in smaller batches and 3x3 convolutions

50 100 150 32 64 128 256 Time (ms) Batch size

Forward Computation Time

Direct GEMM FFT Winograd 200 400 600 32 64 128 256 Time (ms) Batch size

Backward Computation Time

Direct GEMM FFT Winograd 20 40 60 80 32 64 128 256 Time (ms) Batch size

Conv3,4,5 Forward Computation Time

Direct GEMM FFT Winograd 1000 2000 3000 4000 5000 6000 32 64 128 256 Memory (MB) Batch size

VRAM Usage

Direct GEMM FFT Winograd

slide-36
SLIDE 36

Computation Time Comparison

38

  • Direct algorithm shows poor performance on backward computation
  • FFT is the fastest algorithm for most of the time
  • Winograd performs better in smaller batches and 3x3 convolutions
  • Memory usage differences are not significant

50 100 150 32 64 128 256 Time (ms) Batch size

Forward Computation Time

Direct GEMM FFT Winograd 200 400 600 32 64 128 256 Time (ms) Batch size

Backward Computation Time

Direct GEMM FFT Winograd 20 40 60 80 32 64 128 256 Time (ms) Batch size

Conv3,4,5 Forward Computation Time

Direct GEMM FFT Winograd 1000 2000 3000 4000 5000 6000 32 64 128 256 Memory (MB) Batch size

VRAM Usage

Direct GEMM FFT Winograd

slide-37
SLIDE 37

Layer-wise Analysis of Convolution Layers

39

  • Operation count is the primary

factor for the execution time

  • Conv2 layer requires the most

computations

  • Thus, FFT and Winograd are

faster than Direct or GEMM

50 100 150 200 250 conv1 conv2 conv3 conv4 conv5 Time (ms) Backward computation time for each layer Direct GEMM FFT Winograd 50 100 150 200 250 conv1 conv2 conv3 conv4 conv5 Giga Operations Kernel operation counts for each convolution layer Direct GEMM FFT Winograd Theoretical 10 20 30 40 50 conv1 conv2 conv3 conv4 conv5 Time (ms) Forward computation time for each layer Direct GEMM FFT Winograd

slide-38
SLIDE 38

Layer-wise Analysis of Convolution Layers

40

50 100 150 200 250 conv1 conv2 conv3 conv4 conv5 Giga Operations Kernel operation counts for each convolution layer Direct GEMM FFT Winograd Theoretical

  • Operation count is the primary

factor for execution time

  • Conv2 layer requires the most

computations

  • Thus, FFT and Winograd are

faster than Direct or GEMM

  • Direct convolution is slow

because its backward computation in the first layer is inefficient

50 100 150 200 250 conv1 conv2 conv3 conv4 conv5 Time (ms) Backward computation time for each layer Direct GEMM FFT Winograd 10 20 30 40 50 conv1 conv2 conv3 conv4 conv5 Time (ms) Forward computation time for each layer Direct GEMM FFT Winograd

slide-39
SLIDE 39

Conclusions

  • Convolution layers take up most of the computation time

while training CNN models

  • Performance difference of the frameworks are mainly due to

convolution algorithms

  • Choosing optimal options can double the training speed of

the AlexNet model

  • Tensor layout and unnecessary backpropagation might result

in minor performance differences

41

slide-40
SLIDE 40

Conclusions

  • FFT convolution algorithm is the fastest in most of the time

because of its reduced computation complexity

  • Winograd convolution can be faster than FFT in 3x3

convolution layers with small batch sizes

  • Data parallelism is inefficient in most frameworks because of

the communication cost, but some techniques might improve the multi-GPU scalability

42