[PPT] - An In-depth Performance Characterization of CPU- and GPU-based DNN PowerPoint Presentation

SLIDE 1

An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures

Ammar Ahmad Awan, Hari Subramoni, and Dhableswar K. (DK) Panda

Dept. of Computer Science and Engineering

The Ohio State University

Credits: pdfs.semanticscholar.org (Ammar Ahmad Awan)

SLIDE 2

Introduction

– CPU-based Deep Learning – Deep Learning Frameworks

Research Challenges
Design Discussion
Performance Characterization
Conclusion

2

CPU based Deep Learning is not as bad as you think!

SLIDE 3

NVIDIA GPUs have been the main

driving force for faster training of Deep Neural Networks (DNNs)

GPUs: A natural fit for DL due to the

throughput-oriented nature

GPUs are also growing in the HPC

arena!

3

GPUs are great for Deep Learning

SLIDE 4

But what about CPUs?

Intel CPUs are everywhere and many-core CPUs are

emerging according to Top500.org

Host CPUs exist even on the GPU nodes

– Many-core Xeon Phis are increasing

Xeon Phi 1st generation: a many-core co-processor
Xeon Phi 2nd generation (KNL): a self-hosted many-

core processor!

Usually, we hear CPUs are 10x – 100x slower than

GPUs?

– But can we do better?

4

SLIDE 5

There are several Deep Learning (DL) or DNN Training frameworks

– Caffe, Cognitive Toolkit, TensorFlow, MXNet, and counting....

Every (almost every) framework has been optimized for NVIDIA GPUs
But every framework is able to execute on a CPU as well

– So why are we not using them? – Performance has been “terrible” and several studies have reported significant degradation when using CPUs

But there is hope :-)

– Coupled with Intel Xeon Phi (Knights Landing or KNL) and MC-DRAM, the landscape for CPU-based DL looks promising..

5

Deep Learning Frameworks – CPUs or GPUs?

SLIDE 6

Caffe is a popular and widely used framework
NVIDIA-Caffe and BVLC-Caffe (Official Caffe) are almost similar
Intel-Caffe is optimized for CPU-based Deep Learning
OSU-Caffe is a multi-node multi-GPU variant that we have worked on at OSU

6

The DL Framework(s) in discussion: Caffe

Caffe Variant Multi-GPU Support Multi-node Support Multi-node Communication BVLC-Caffe Yes No N/A NVIDIA-Caffe Yes No N/A Intel-Caffe N/A Yes Intel MLSL 2017.1.016 (with Intel MPI 2017) OSU-Caffe Yes Yes MVAPICH2-GDR 2.2

SLIDE 7

Agenda

Introduction
Research Challenges
Design Discussion
Performance Characterization
Conclusion

7

SLIDE 8

Can we provide a holistic yet comprehensive view of DNN training performance for a diverse set of hardware architectures including Intel Xeon Phi (KNL) processors and NVIDIA Pascal GPUs?

8

The Key Question!

SLIDE 9

Introduction
Research Challenges
Design Discussion

– Caffe Architecture – Understanding the Impact of Execution Environments

Performance Characterization
Conclusion

9

Agenda

SLIDE 10

Bcast (GPU0) packed_comm_buff L1 L2 .. Ln F L1 L2 .. Ln L1 L2 .. Ln L1 L2 .. Ln Params GPU0 Params GPU1 Params GPU2 Params GPU3 Gradients

1. Data

Propagation

2. Forward

BackwardPass

3. Gradient

Aggregation B F B F B F B packed_reduce_ buff packed_reduce_ buff packed_reduce_ buff packed_reduce_ buff Reduce (GPU 0) 10 Loop {} ApplyUpdates

Caffe Architecture

SLIDE 11

Performance is dependent on:

1. Hardware Architectures

– GPUs – Multi-/Many-core CPUs

2. Software Libraries

– cuDNN (for GPUs) – MKL-DNN/MKL 2017 (for CPUs)

3. Hardware/Software co-design

– Software libraries optimized for

ne platform will not help the
ther!

Understanding the Impact of Execution Environments

DLApplications(Image R ecognition, S peech P rocessing, etc.) DLFrameworks(Caffe, T ensorFlow, etc.) BLASLibraries Hardware Many-core GPU (P ascal P100) Generic ConvolutionLayer MKL Optimized ConvolutionLayer MKL 2017 cuDNN/ cuBLAS Multi-/ Many-core (Xeon, XeonP hi) cuDNN Optimized ConvolutionLayer Other BLASLibraries OpenBLAS

11

A TLAS Other Processors

SLIDE 12

Introduction
Research Challenges
Design Discussion
Performance Characterization

– Single-node Performance – Multi-node Performance

Conclusion

12

Agenda

SLIDE 13

Several GPU generations and CPU architectures
Single-node Results for AlexNet and ResNet-50

– Impact of MKL engine – Impact of MC-DRAM – Layer-wise breakdown – P100 vs. KNL

Multi-node results using Intel-Caffe and OSU-Caffe

– Weak scaling – ResNet-50 and AlexNet

13

Performance Characterization

SLIDE 14

Name (Label) Processor Architecture (Description)

No. of Cores
No. of Sockets

Haswell1 Intel Xeon CPU E5-2660 v3 @ 2.60 GHz 20 (2*10) 2 Haswell2 Intel Xeon CPU E5-2687 v3 @ 3.10 GHz 20 (2*10) 2 Broadwell Intel Xeon CPU E5-2680 v4 @ 2.40 GHz 28 (2*14) 2 KNL Intel Xeon Phi CPU 7250 @ 1.40 GHz 68 (1*68) 1 K40 NVIDIA Tesla K40 11.8GB @ 0.75 GHz 2880 CUDA Cores N/A K80 NVIDIA Tesla K80 11.8GB @ 0.82 GHz 2496 CUDA Cores N/A P100 NVIDIA Tesla P100-PCIE 1 6GB @ 1.33 GHz 3584 CUDA Cores N/A

14

Performance Characterization: Various Architectures

SLIDE 15

The comparison of optimized

MKL engine and the default Caffe engine

MKL engine is up to 3X better

than default Caffe engine

Biggest gains for Intel Xeon Phi

(KNL) (many-core) architecture

Both Haswell and Broadwell

architectures get significant speedups (up to 1.5X)

Single-node: Impact of MKL engine in Intel-Caffe

Training Time(ms ) 15 1800 1600 1400 1200 1000 800 600 400 200 CPUArchitectures

SLIDE 16

Single-node: Impact of Utilizing MCDRAM

“MCDRAM as Cache” and

“MCDRAM-All” offer very similar performance

MCDRAM as Cache was

chosen for all the subsequent results

On average, DDR-All is up to

1.5X slower than MCDRAM

16

SLIDE 17

Diving Deeper: Layer-wise Breakdown

500 450 400 350 300 250 200 150 100 50 Time (ms) conv1 conv2 conv3 conv4 conv5

The full landscape for AlexNet: Forward and Backward Pass
Faster Convolutions  Faster Training
Most performance gains are based on conv2 and conv3 for AlexNet

Time (ms) 800 700 600 500 400 300 200 100 conv1 conv2 conv3 conv4 conv5

17

SLIDE 18

Fully connected layers are much slower on

KNL compared to P100

conv1 and conv3 also contribute to

degradation on KNL

conv2 is faster on KNL compared to P100

Diving Deeper: P100 vs. KNL (AlexNet)

Time (ms ) 200 180 160 140 120 100 80 60 40 20 P100 KNL-Opt HardwareArchitecture conv1 conv2 conv3 conv4 conv5 fc6 fc7

18

SLIDE 19

Multi-node Results: ResNet-50

All results are weak scaling
Images/second is a derived metric but

more meaningful for understanding scalability

100 200 300 400 500 600 400 350 300 250 200 150 100 50 2 20 32

Images/second Training Time (seconds)

4 8 16

No. of Nodes

Time (seconds) Images/ second

ResNet-50 Intel-Caffe

25

SLIDE 20

Multi-node Results: AlexNet Comparison

– Different frameworks so not directly comparable – A rough comparison can still help in understanding scalability trends – Design of framework can affect performance for distributed training

MPI (or the communication runtime) can cause a marked difference

3500 3000 2500 2000 1500 1000 500 1 2 4 16 20 32

Images Per S econd

8

No. ofNodes

OS U-Caffe(GPU) Intel-Caffe (CPU) 90 80 70 60 50 40 30 20 10 1 2 4 16 20 32

Training Time (seconds)

8

No. ofNodes

OS U-Caffe(GPU) Intel-Caffe (CPU)

OSU-Caffe vs. Intel-Caffe

20

SLIDE 21

Agenda

Introduction
Research Challenges
Design Comparisons
Performance Characterization
Conclusion

21

SLIDE 22

Conclusion

22

CPU is very comparable to GPU for DNN Training workloads if

appropriate optimizations are exploited

GPUs are still faster than CPUs in general
KNL beats P100 for one case but P100 beats KNL for most cases
Evaluating the performance of a DL framework

– The hardware architecture matters – But software stack has a higher and more significant impact than hardware – The full execution environment and communication runtime needs to be evaluated to ensure fairness in comparisons

SLIDE 23

Performance Characterization of DNN Trainingusing TensorFlow and PyTorch on Modern Clusters

Arpan Jain, Ammar Ahmad Awan, Quentin Anthony, Hari Subramoni, and Dhableswar K. (DK) Panda

Dept. of Computer Science and Engineering

The Ohio State University

Credits: http://nowlab.cse.ohio-state.edu/static/media/talks (Arpan Jain)

SLIDE 24

Introduction
Background
Research Challenges
Characterization Strategy

– Evaluation Platforms and SoftwareLibraries – Experimental Setup

Performance Evaluation
Conclusion

24

Agenda

SLIDE 25

Easily implement and experiment with Deep Neural Networks

– Several Deep Learning (DL) frameworks have emerged

Caffe, PyTorch, TensorFlow, MXNet, and counting....

– Focus on TensorFlow and PyTorch

Most frameworks - optimized for NVIDIAGPUs–

– but CPU optimized implementations are also emerging as we saw in the previous paper

25

Deep Learning Frameworks

SLIDE 26

The most widely used framework open-sourced by Google
Replaced Google’s DistBelief framework
Runs on almost all execution platforms available (CPU, GPU, TPU, Mobile,

etc.)

https://github.com/tensorflow/tensorflow

Deep Learning and TensorFlow

26

SLIDE 27

Introduction
Background
Research Challenges
Characterization Strategy

– Evaluation Platforms and SoftwareLibraries – Experimental Setup

Performance Evaluation
Conclusion

27

Agenda

SLIDE 28

Deep Neural Network training consists of two phases

– Forward pass – Backward pass

Distributed DNN Training

Data Parallelism

28

Model Parallelism

Two approaches to Distribute DNN training

– Data Parallelism (focus of this paper) – Model Parallelism

SLIDE 29

Most ML/DL frameworks – started single-node/single-GPU design

– Various multi-node design schemes have emerged since then!

Distributed Training needs communication libraries to synchronize across nodes
DL Frameworks

– Caffe – TensorFlow and PyTorch with Horovod (focus of thispaper)

Communication Libraries for DL

– MPI Libraries: MVAPICH2, IntelMPI,OpenMPI – NVIDIA NCCL (GPU only)

29

DL Frameworks and CommunicationLibraries

SLIDE 30

What is Allreduce? And How DL frameworks use it?

A generic group communication pattern – element-wise

vector sum available to all participants in the group

In the MPI world, we call it MPI_Allreduce
Needed in DNN Training during gradient aggregation from

different workers

30

SLIDE 31

Introduction
Background
Research Challenges
Characterization Strategy

– Evaluation Platforms and SoftwareLibraries – Experimental Setup

Performance Evaluation
Conclusion

31

Agenda

SLIDE 32

How to systemically characterize CPU- based DNN Training using TensorFlow and PyTorch at scale? And how to achieve best possible performance for different HPC systems?

32

Broad Challenge

SLIDE 33

Key Contributions

33

Describe single-process (SP), multi-process (MP), andmulti-node

(MN) approach

Highlight up to 1.47× better performance for MP approach overSP

approach

Evaluate five DNN architectures at scale (128 Xeon Skylakenodes)
Report 125× speedup on 128 nodes for ResNet-152 withMVAPICH2
Summarize key insights gained from thesystematic

characterization

SLIDE 34

Introduction
Background
Research Challenges
Characterization Strategy

– Evaluation Platforms and Software Libraries – Experimental Setup

Performance Evaluation
Conclusion

34

Agenda

SLIDE 35

Architecture Cluster Speed (GHz) Cores Threads per core Label Skylake RI2 2.6 28 1 Skylake-1 Skylake Pitzer 2.4 40 1 Skylake-2 Skylake Stampede2 2.1 48 2 Skylake-3 Broadwell RI2 2.4 28 1 Broadwell EPYC AMD-Cluster 2.0 32 4 EPYC

35

Evaluation Platforms

K80 RI2

4992(Dual

socket )

K80

P100 Owens

3584
P100

V100 Pitzer

Cuda: 5120

Tensor: 640

V100

SLIDE 36

Deep Learning Frameworks

– Intel optimized TensorFlow (v1.12), -- details on the next slide – TensorFlow v1.12 (for GPUs and AMD processors) – PyTorch (v1.1)

Horovod Distributed Training middleware
MPI Library: MVAPICH2
Scripts: tf_cnn_benchmarks and Horovod’s

pytorch_synthetic_benchmarks

36

Software Libraries

SLIDE 37

Optimized by Intel for Intel Xeon CPUs
Uses Math Kernel Library for Deep Neural Networks –(MKL-DNN) primitives
Can be installed easily using conda and pip
https://github.com/Intel-tensorflow

37

Intel Optimized TensorFlow

SLIDE 38

Introduction
Background
Research Challenges
Characterization Strategy

– Evaluation Platforms and SoftwareLibraries – Experimental Setup

Performance Evaluation
Conclusion

38

Agenda

SLIDE 39

Four different types of experiments were performed

1. Single Node Single Process (SP) Experiments
2. Single Node Multi-Process (MP) Experiments
3. Multi-Node Multi-Process (MN) Experiments
4. GPU vs. CPU Comparisons

39

Experimental Setup

SLIDE 40

Introduction
Background
Research Challenges
Characterization Strategy

– Evaluation Platforms and SoftwareLibraries – Experimental setup

Performance Evaluation
Conclusion

40

Agenda

SLIDE 41

Single Node Single Process (SP) Experiments

ResNet-50 Training performance

Different configurations lead to different performance trends
Key Message: Process per node (PPN), Batch Size, and number of threadsare tunable parameters
Parameters need to be determined and tuned properly!

41

SLIDE 42

SP: Effect of Hyper-Threading

ResNet-50 Training performance

Skylake-3 on Stampede2 is

hyper-threaded (two threads per core)

Possible to run TF on 96

threads

But, performance degrades

beyond 48 threads – Why?

Depends on the size and

type of DNN

42

SLIDE 43

Single Node Multi-Process (MP) Experiments

ResNet-152 Training performance

BS=64, 4ppn is better
BS=32, 8ppn is slightly better
However, keeping effective batch size (EBS)low

is more important! – Why? (DNN does not converge to SOTA when batch size islarge) ResNet-152 (SP vs. MP)

MP is better for all effective batchsizes
Up to 1.35X better performance for MP

compared to SP for BS=64. 1.35X

43

SLIDE 44

We use the best SP configuration to run Multi-node experiments
Evaluate five models to identify common trends

– All models give near-linear scaling on both platforms

Multi-Node Multi-Process (MN)Experiments

Skylake-1 (28 cores) Skylake-2 (40 cores)

44

SLIDE 45

Multi-Node Multi-Process (MN): MP vs. SP?

Skylake-3 (48 cores, 96 threads)

Scale—32 nodes
MP-Tuned—up to 1.5X

better than SP

MP-Tuned—10%better

than MP-Default

Why MP-Tunedis

better?

– Uses the best possible number of inter-op and intra-op threads

45

SLIDE 46

Multi-Node Multi-Process (MN): TF vs. PyTorch

PyTorch

This is an early experience with

PyTorch

TensorFlow is up to 2.5X faster

than PyTorch for 128 Nodes.

46

TensorFlow

TensorFlow: up to 125X speedup

for ResNet-152 on 128 nodes

PyTorch: Scales well but overall

lower performance than TensorFlow

SLIDE 47

Multi-Node Multi-Process (MN): AMD Platform

EPYC for TensorFlow

47

TensorFlow is 4X slower on EPYC

compared to Skylake-3

For EPYC, there is no optimized

TensorFlow EPYC for PyTorch

PyTorch—better than TensorFlow
Up to 19% better than TensorFlow on

8 nodes.

SLIDE 48

TensorFlow and PyTorch: CPU vs. GPU

TensorFlow on GPUs vs. CPUs

48

Inception-v4 : Skylake-3 up to 2.35X faster

than K80s

ResNet-101: V100s up to 3.32X faster than

Skylake-3 Multi-Node: TensorFlow (TF) vs. PyTorch (PT)

ResNet-50: PT slightly better TF
ResNet-152, PT up to 12% better thanTF

SLIDE 49

Introduction
Background
Research Challenges
Characterization Strategy

– Evaluation Platforms and SoftwareLibraries – Experimental setup

Performance Evaluation
Conclusion

29

Agenda

SLIDE 50

In-depth Characterization for Distributed Training with TensorFlow and early

results for PyTorch

– Experiments on five HPC clusters including Stampede2 and three different CPU architectures: Skylake, Broadwell, and AMD EPYC – Single Node Single Process (SP) and Single Node Multi Process (MP) to determine best performance for single node experiments – Use best single-node configuration for multi-Node experiments – Up to 128 nodes to show DNN training scaling – GPU vs. CPU comparisons for both TensorFlow and PyTorch

Guidelines for the DL Researchers to get best performance on CPU platforms

50