An In-depth Performance Characterization of CPU- and GPU-based DNN - - PowerPoint PPT Presentation

an in depth performance characterization of cpu and gpu
SMART_READER_LITE
LIVE PREVIEW

An In-depth Performance Characterization of CPU- and GPU-based DNN - - PowerPoint PPT Presentation

An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures Ammar Ahmad Awan, Hari Subramoni, and Dhableswar K. (DK) Panda Dept. of Computer Science and Engineering The Ohio State University Credits:


slide-1
SLIDE 1

An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures

Ammar Ahmad Awan, Hari Subramoni, and Dhableswar K. (DK) Panda

  • Dept. of Computer Science and Engineering

The Ohio State University

Credits: pdfs.semanticscholar.org (Ammar Ahmad Awan)

slide-2
SLIDE 2
  • Introduction

– CPU-based Deep Learning – Deep Learning Frameworks

  • Research Challenges
  • Design Discussion
  • Performance Characterization
  • Conclusion

2

CPU based Deep Learning is not as bad as you think!

slide-3
SLIDE 3
  • NVIDIA GPUs have been the main

driving force for faster training of Deep Neural Networks (DNNs)

  • GPUs: A natural fit for DL due to the

throughput-oriented nature

  • GPUs are also growing in the HPC

arena!

3

GPUs are great for Deep Learning

slide-4
SLIDE 4

But what about CPUs?

  • Intel CPUs are everywhere and many-core CPUs are

emerging according to Top500.org

  • Host CPUs exist even on the GPU nodes

– Many-core Xeon Phis are increasing

  • Xeon Phi 1st generation: a many-core co-processor
  • Xeon Phi 2nd generation (KNL): a self-hosted many-

core processor!

  • Usually, we hear CPUs are 10x – 100x slower than

GPUs?

– But can we do better?

4

slide-5
SLIDE 5
  • There are several Deep Learning (DL) or DNN Training frameworks

– Caffe, Cognitive Toolkit, TensorFlow, MXNet, and counting....

  • Every (almost every) framework has been optimized for NVIDIA GPUs
  • But every framework is able to execute on a CPU as well

– So why are we not using them? – Performance has been “terrible” and several studies have reported significant degradation when using CPUs

  • But there is hope :-)

– Coupled with Intel Xeon Phi (Knights Landing or KNL) and MC-DRAM, the landscape for CPU-based DL looks promising..

5

Deep Learning Frameworks – CPUs or GPUs?

slide-6
SLIDE 6
  • Caffe is a popular and widely used framework
  • NVIDIA-Caffe and BVLC-Caffe (Official Caffe) are almost similar
  • Intel-Caffe is optimized for CPU-based Deep Learning
  • OSU-Caffe is a multi-node multi-GPU variant that we have worked on at OSU

6

The DL Framework(s) in discussion: Caffe

Caffe Variant Multi-GPU Support Multi-node Support Multi-node Communication BVLC-Caffe Yes No N/A NVIDIA-Caffe Yes No N/A Intel-Caffe N/A Yes Intel MLSL 2017.1.016 (with Intel MPI 2017) OSU-Caffe Yes Yes MVAPICH2-GDR 2.2

slide-7
SLIDE 7

Agenda

  • Introduction
  • Research Challenges
  • Design Discussion
  • Performance Characterization
  • Conclusion

7

slide-8
SLIDE 8

Can we provide a holistic yet comprehensive view of DNN training performance for a diverse set of hardware architectures including Intel Xeon Phi (KNL) processors and NVIDIA Pascal GPUs?

8

The Key Question!

slide-9
SLIDE 9
  • Introduction
  • Research Challenges
  • Design Discussion

– Caffe Architecture – Understanding the Impact of Execution Environments

  • Performance Characterization
  • Conclusion

9

Agenda

slide-10
SLIDE 10

Bcast (GPU0) packed_comm_buff L1 L2 .. Ln F L1 L2 .. Ln L1 L2 .. Ln L1 L2 .. Ln Params GPU0 Params GPU1 Params GPU2 Params GPU3 Gradients

  • 1. Data

Propagation

  • 2. Forward

BackwardPass

  • 3. Gradient

Aggregation B F B F B F B packed_reduce_ buff packed_reduce_ buff packed_reduce_ buff packed_reduce_ buff Reduce (GPU 0) 10 Loop {} ApplyUpdates

Caffe Architecture

slide-11
SLIDE 11

Performance is dependent on:

  • 1. Hardware Architectures

– GPUs – Multi-/Many-core CPUs

  • 2. Software Libraries

– cuDNN (for GPUs) – MKL-DNN/MKL 2017 (for CPUs)

  • 3. Hardware/Software co-design

– Software libraries optimized for

  • ne platform will not help the
  • ther!

Understanding the Impact of Execution Environments

DLApplications(Image R ecognition, S peech P rocessing, etc.) DLFrameworks(Caffe, T ensorFlow, etc.) BLASLibraries Hardware Many-core GPU (P ascal P100) Generic ConvolutionLayer MKL Optimized ConvolutionLayer MKL 2017 cuDNN/ cuBLAS Multi-/ Many-core (Xeon, XeonP hi) cuDNN Optimized ConvolutionLayer Other BLASLibraries OpenBLAS

11

A TLAS Other Processors

slide-12
SLIDE 12
  • Introduction
  • Research Challenges
  • Design Discussion
  • Performance Characterization

– Single-node Performance – Multi-node Performance

  • Conclusion

12

Agenda

slide-13
SLIDE 13
  • Several GPU generations and CPU architectures
  • Single-node Results for AlexNet and ResNet-50

– Impact of MKL engine – Impact of MC-DRAM – Layer-wise breakdown – P100 vs. KNL

  • Multi-node results using Intel-Caffe and OSU-Caffe

– Weak scaling – ResNet-50 and AlexNet

13

Performance Characterization

slide-14
SLIDE 14

Name (Label) Processor Architecture (Description)

  • No. of Cores
  • No. of Sockets

Haswell1 Intel Xeon CPU E5-2660 v3 @ 2.60 GHz 20 (2*10) 2 Haswell2 Intel Xeon CPU E5-2687 v3 @ 3.10 GHz 20 (2*10) 2 Broadwell Intel Xeon CPU E5-2680 v4 @ 2.40 GHz 28 (2*14) 2 KNL Intel Xeon Phi CPU 7250 @ 1.40 GHz 68 (1*68) 1 K40 NVIDIA Tesla K40 11.8GB @ 0.75 GHz 2880 CUDA Cores N/A K80 NVIDIA Tesla K80 11.8GB @ 0.82 GHz 2496 CUDA Cores N/A P100 NVIDIA Tesla P100-PCIE 1 6GB @ 1.33 GHz 3584 CUDA Cores N/A

14

Performance Characterization: Various Architectures

slide-15
SLIDE 15
  • The comparison of optimized

MKL engine and the default Caffe engine

  • MKL engine is up to 3X better

than default Caffe engine

  • Biggest gains for Intel Xeon Phi

(KNL) (many-core) architecture

  • Both Haswell and Broadwell

architectures get significant speedups (up to 1.5X)

Single-node: Impact of MKL engine in Intel-Caffe

Training Time(ms ) 15 1800 1600 1400 1200 1000 800 600 400 200 CPUArchitectures

slide-16
SLIDE 16

Single-node: Impact of Utilizing MCDRAM

  • “MCDRAM as Cache” and

“MCDRAM-All” offer very similar performance

  • MCDRAM as Cache was

chosen for all the subsequent results

  • On average, DDR-All is up to

1.5X slower than MCDRAM

16

slide-17
SLIDE 17

Diving Deeper: Layer-wise Breakdown

500 450 400 350 300 250 200 150 100 50 Time (ms) conv1 conv2 conv3 conv4 conv5

  • The full landscape for AlexNet: Forward and Backward Pass
  • Faster Convolutions  Faster Training
  • Most performance gains are based on conv2 and conv3 for AlexNet

Time (ms) 800 700 600 500 400 300 200 100 conv1 conv2 conv3 conv4 conv5

17

slide-18
SLIDE 18
  • Fully connected layers are much slower on

KNL compared to P100

  • conv1 and conv3 also contribute to

degradation on KNL

  • conv2 is faster on KNL compared to P100

Diving Deeper: P100 vs. KNL (AlexNet)

Time (ms ) 200 180 160 140 120 100 80 60 40 20 P100 KNL-Opt HardwareArchitecture conv1 conv2 conv3 conv4 conv5 fc6 fc7

18

slide-19
SLIDE 19

Multi-node Results: ResNet-50

  • All results are weak scaling
  • Images/second is a derived metric but

more meaningful for understanding scalability

100 200 300 400 500 600 400 350 300 250 200 150 100 50 2 20 32

Images/second Training Time (seconds)

4 8 16

  • No. of Nodes

Time (seconds) Images/ second

ResNet-50 Intel-Caffe

25

slide-20
SLIDE 20

Multi-node Results: AlexNet Comparison

– Different frameworks so not directly comparable – A rough comparison can still help in understanding scalability trends – Design of framework can affect performance for distributed training

  • MPI (or the communication runtime) can cause a marked difference

3500 3000 2500 2000 1500 1000 500 1 2 4 16 20 32

Images Per S econd

8

  • No. ofNodes

OS U-Caffe(GPU) Intel-Caffe (CPU) 90 80 70 60 50 40 30 20 10 1 2 4 16 20 32

Training Time (seconds)

8

  • No. ofNodes

OS U-Caffe(GPU) Intel-Caffe (CPU)

  • OSU-Caffe vs. Intel-Caffe

20

slide-21
SLIDE 21

Agenda

  • Introduction
  • Research Challenges
  • Design Comparisons
  • Performance Characterization
  • Conclusion

21

slide-22
SLIDE 22

Conclusion

22

  • CPU is very comparable to GPU for DNN Training workloads if

appropriate optimizations are exploited

  • GPUs are still faster than CPUs in general
  • KNL beats P100 for one case but P100 beats KNL for most cases
  • Evaluating the performance of a DL framework

– The hardware architecture matters – But software stack has a higher and more significant impact than hardware – The full execution environment and communication runtime needs to be evaluated to ensure fairness in comparisons

slide-23
SLIDE 23

Performance Characterization of DNN Trainingusing TensorFlow and PyTorch on Modern Clusters

Arpan Jain, Ammar Ahmad Awan, Quentin Anthony, Hari Subramoni, and Dhableswar K. (DK) Panda

  • Dept. of Computer Science and Engineering

The Ohio State University

Credits: http://nowlab.cse.ohio-state.edu/static/media/talks (Arpan Jain)

slide-24
SLIDE 24
  • Introduction
  • Background
  • Research Challenges
  • Characterization Strategy

– Evaluation Platforms and SoftwareLibraries – Experimental Setup

  • Performance Evaluation
  • Conclusion

24

Agenda

slide-25
SLIDE 25
  • Easily implement and experiment with Deep Neural Networks

– Several Deep Learning (DL) frameworks have emerged

  • Caffe, PyTorch, TensorFlow, MXNet, and counting....

– Focus on TensorFlow and PyTorch

  • Most frameworks - optimized for NVIDIAGPUs–

– but CPU optimized implementations are also emerging as we saw in the previous paper

25

Deep Learning Frameworks

slide-26
SLIDE 26
  • The most widely used framework open-sourced by Google
  • Replaced Google’s DistBelief framework
  • Runs on almost all execution platforms available (CPU, GPU, TPU, Mobile,

etc.)

  • https://github.com/tensorflow/tensorflow

Deep Learning and TensorFlow

26

slide-27
SLIDE 27
  • Introduction
  • Background
  • Research Challenges
  • Characterization Strategy

– Evaluation Platforms and SoftwareLibraries – Experimental Setup

  • Performance Evaluation
  • Conclusion

27

Agenda

slide-28
SLIDE 28
  • Deep Neural Network training consists of two phases

– Forward pass – Backward pass

Distributed DNN Training

Data Parallelism

28

Model Parallelism

  • Two approaches to Distribute DNN training

– Data Parallelism (focus of this paper) – Model Parallelism

slide-29
SLIDE 29
  • Most ML/DL frameworks – started single-node/single-GPU design

– Various multi-node design schemes have emerged since then!

  • Distributed Training needs communication libraries to synchronize across nodes
  • DL Frameworks

– Caffe – TensorFlow and PyTorch with Horovod (focus of thispaper)

  • Communication Libraries for DL

– MPI Libraries: MVAPICH2, IntelMPI,OpenMPI – NVIDIA NCCL (GPU only)

29

DL Frameworks and CommunicationLibraries

slide-30
SLIDE 30

What is Allreduce? And How DL frameworks use it?

  • A generic group communication pattern – element-wise

vector sum available to all participants in the group

  • In the MPI world, we call it MPI_Allreduce
  • Needed in DNN Training during gradient aggregation from

different workers

30

slide-31
SLIDE 31
  • Introduction
  • Background
  • Research Challenges
  • Characterization Strategy

– Evaluation Platforms and SoftwareLibraries – Experimental Setup

  • Performance Evaluation
  • Conclusion

31

Agenda

slide-32
SLIDE 32

How to systemically characterize CPU- based DNN Training using TensorFlow and PyTorch at scale? And how to achieve best possible performance for different HPC systems?

32

Broad Challenge

slide-33
SLIDE 33

Key Contributions

33

  • Describe single-process (SP), multi-process (MP), andmulti-node

(MN) approach

  • Highlight up to 1.47× better performance for MP approach overSP

approach

  • Evaluate five DNN architectures at scale (128 Xeon Skylakenodes)
  • Report 125× speedup on 128 nodes for ResNet-152 withMVAPICH2
  • Summarize key insights gained from thesystematic

characterization

slide-34
SLIDE 34
  • Introduction
  • Background
  • Research Challenges
  • Characterization Strategy

– Evaluation Platforms and Software Libraries – Experimental Setup

  • Performance Evaluation
  • Conclusion

34

Agenda

slide-35
SLIDE 35

Architecture Cluster Speed (GHz) Cores Threads per core Label Skylake RI2 2.6 28 1 Skylake-1 Skylake Pitzer 2.4 40 1 Skylake-2 Skylake Stampede2 2.1 48 2 Skylake-3 Broadwell RI2 2.4 28 1 Broadwell EPYC AMD-Cluster 2.0 32 4 EPYC

35

Evaluation Platforms

K80 RI2

  • 4992(Dual

socket )

  • K80

P100 Owens

  • 3584
  • P100

V100 Pitzer

  • Cuda: 5120

Tensor: 640

  • V100
slide-36
SLIDE 36
  • Deep Learning Frameworks

– Intel optimized TensorFlow (v1.12), -- details on the next slide – TensorFlow v1.12 (for GPUs and AMD processors) – PyTorch (v1.1)

  • Horovod Distributed Training middleware
  • MPI Library: MVAPICH2
  • Scripts: tf_cnn_benchmarks and Horovod’s

pytorch_synthetic_benchmarks

36

Software Libraries

slide-37
SLIDE 37
  • Optimized by Intel for Intel Xeon CPUs
  • Uses Math Kernel Library for Deep Neural Networks –(MKL-DNN) primitives
  • Can be installed easily using conda and pip
  • https://github.com/Intel-tensorflow

37

Intel Optimized TensorFlow

slide-38
SLIDE 38
  • Introduction
  • Background
  • Research Challenges
  • Characterization Strategy

– Evaluation Platforms and SoftwareLibraries – Experimental Setup

  • Performance Evaluation
  • Conclusion

38

Agenda

slide-39
SLIDE 39

Four different types of experiments were performed

  • 1. Single Node Single Process (SP) Experiments
  • 2. Single Node Multi-Process (MP) Experiments
  • 3. Multi-Node Multi-Process (MN) Experiments
  • 4. GPU vs. CPU Comparisons

39

Experimental Setup

slide-40
SLIDE 40
  • Introduction
  • Background
  • Research Challenges
  • Characterization Strategy

– Evaluation Platforms and SoftwareLibraries – Experimental setup

  • Performance Evaluation
  • Conclusion

40

Agenda

slide-41
SLIDE 41

Single Node Single Process (SP) Experiments

ResNet-50 Training performance

  • Different configurations lead to different performance trends
  • Key Message: Process per node (PPN), Batch Size, and number of threadsare tunable parameters
  • Parameters need to be determined and tuned properly!

41

slide-42
SLIDE 42

SP: Effect of Hyper-Threading

ResNet-50 Training performance

  • Skylake-3 on Stampede2 is

hyper-threaded (two threads per core)

  • Possible to run TF on 96

threads

  • But, performance degrades

beyond 48 threads – Why?

  • Depends on the size and

type of DNN

42

slide-43
SLIDE 43

Single Node Multi-Process (MP) Experiments

ResNet-152 Training performance

  • BS=64, 4ppn is better
  • BS=32, 8ppn is slightly better
  • However, keeping effective batch size (EBS)low

is more important! – Why? (DNN does not converge to SOTA when batch size islarge) ResNet-152 (SP vs. MP)

  • MP is better for all effective batchsizes
  • Up to 1.35X better performance for MP

compared to SP for BS=64. 1.35X

43

slide-44
SLIDE 44
  • We use the best SP configuration to run Multi-node experiments
  • Evaluate five models to identify common trends

– All models give near-linear scaling on both platforms

Multi-Node Multi-Process (MN)Experiments

Skylake-1 (28 cores) Skylake-2 (40 cores)

44

slide-45
SLIDE 45

Multi-Node Multi-Process (MN): MP vs. SP?

Skylake-3 (48 cores, 96 threads)

  • Scale—32 nodes
  • MP-Tuned—up to 1.5X

better than SP

  • MP-Tuned—10%better

than MP-Default

  • Why MP-Tunedis

better?

– Uses the best possible number of inter-op and intra-op threads

45

slide-46
SLIDE 46

Multi-Node Multi-Process (MN): TF vs. PyTorch

PyTorch

  • This is an early experience with

PyTorch

  • TensorFlow is up to 2.5X faster

than PyTorch for 128 Nodes.

46

TensorFlow

  • TensorFlow: up to 125X speedup

for ResNet-152 on 128 nodes

  • PyTorch: Scales well but overall

lower performance than TensorFlow

slide-47
SLIDE 47

Multi-Node Multi-Process (MN): AMD Platform

EPYC for TensorFlow

47

  • TensorFlow is 4X slower on EPYC

compared to Skylake-3

  • For EPYC, there is no optimized

TensorFlow EPYC for PyTorch

  • PyTorch—better than TensorFlow
  • Up to 19% better than TensorFlow on

8 nodes.

slide-48
SLIDE 48

TensorFlow and PyTorch: CPU vs. GPU

TensorFlow on GPUs vs. CPUs

48

  • Inception-v4 : Skylake-3 up to 2.35X faster

than K80s

  • ResNet-101: V100s up to 3.32X faster than

Skylake-3 Multi-Node: TensorFlow (TF) vs. PyTorch (PT)

  • ResNet-50: PT slightly better TF
  • ResNet-152, PT up to 12% better thanTF
slide-49
SLIDE 49
  • Introduction
  • Background
  • Research Challenges
  • Characterization Strategy

– Evaluation Platforms and SoftwareLibraries – Experimental setup

  • Performance Evaluation
  • Conclusion

29

Agenda

slide-50
SLIDE 50
  • In-depth Characterization for Distributed Training with TensorFlow and early

results for PyTorch

– Experiments on five HPC clusters including Stampede2 and three different CPU architectures: Skylake, Broadwell, and AMD EPYC – Single Node Single Process (SP) and Single Node Multi Process (MP) to determine best performance for single node experiments – Use best single-node configuration for multi-Node experiments – Up to 128 nodes to show DNN training scaling – GPU vs. CPU comparisons for both TensorFlow and PyTorch

  • Guidelines for the DL Researchers to get best performance on CPU platforms

50

Conclusion