[PPT] - Deep Learning on HPC: Performance Factors and Lessons Learned Weijia PowerPoint Presentation

SLIDE 1

Deep Learning on HPC: Performance Factors and Lessons Learned

Weijia Xu Scalable Computational Intelligence Group Texas Advanced Computing Center The University of Texas at Austin BenchCounil’19 Denver, CO

1

SLIDE 2

Outline

Motivating applications at TACC
Challenges for running deep learning on HPC.
Scalability and accuracy
Scalability and I/O
Memory error impact
Conclusions and Discussions

2

SLIDE 3

Motivating Applications

Traffic Camera Video Analysis
In Collaboration with City of Austin
~ 540MB/hour MPEG4 video from one camera
~100GB for a typical study from a single camera

3 [1] “Deep learning methods to leverage traffic monitoring cameras for pedestrian data applications” Weijia Xu, Natalia Ruiz, Ruizhu Huang, Joel Meyer, and Jen Duthie, John Clary, 26th ITS Word Congress, (Best Technical Paper) [2] Detecting Pedestrian Crossing Events in Large Video Data form Traffic Monitoring Cameras, Weijia Xu, Natalia Ruiz, Kelly Pierce, Ruizhu Huang, Joel Meyer, and Jen Duthie, to appear IEEE BigData2019

SLIDE 4

Motivating Applications

There are over 400 CCTV IP cameras

within city limit of Austin

Mostly just used for manual monitoring
With deep learning, we can
Learn more about traffic pattern
Understand how road are used
Improving pedestrian safety.
A lot of unexpected…

4

SLIDE 5

Motivating Applications

Neural image resolution enhancement with super

resolution generative adversarial network.

In collaboration with Salk Institute
~600 GB neural image dataset
Pytorch+FastAI, Each run of the early version takes

~24 hours on 16 NVIDIA V100 GPUs

5 Biorixv’19 Fang, L., Monroe, F ., Novak, S.W., Kirk, L., Schiavon, C.R., Seungyoon, B.Y., Zhang, T., Wu, M., Kastner, K., Kubota, Y.   and Zhang, Z., 2019. Deep Learning-Based Point-Scanning Super-Resolution Imaging. bioRxiv, p.740548.

SLIDE 6

Motivating Applications

Face recognition
In Collaboration with NASA JPL
~100 GB image data
TensorFlow + Horovod
Each run takes ~12 hours on 16 NVIDIA GTX 1080 Ti GPUs

6 [1]DLS’19-1 Mattmann, Chris A., Zhang, Z.. Deep Facial Recognition with TensorFlow,   The 3rd Deep Learning on Supercomputers Workshop, in conjunction with SC’19, Denver, CO  [2] Courtesy image from: https://megapixels.cc/datasets/msceleb/

SLIDE 7

More DL Applications at TACC

Deep learning is both compute intensive and data

intensive.

7

SLIDE 8

Scale up vs. Scale out

Scale Up
Better and faster GPU cards /

Specialized hardware, e.g. TPU

High acquisition cost to build large

cluster

Scale Out
Using more computing nodes
Consistent with traditional HPC
perations.
Specific challenges
Accuracy vs scalability
I/O issues

8

SLIDE 9

The Race of ResNet50

90-epoch ResNet-50 training finished in 20 mins on

2,048 KNL with 74.9% top-1 accuracy

Against 8 GPUs baseline

9

1 29 28 56 116 264 466 791 100 200 300 400 500 600 700 800 900 He et al. (Microsoft) Goyal et al. (Facebook) Cordeanu et

al. (SURF &

Intel) You et al (UBC & TACC) preferred networks Tencent & HKBU Sony Research Google 1024 TPUv3

ResNet 50 ImageNet Training Acceleration from 2016~2019

SLIDE 10

Accuracy vs Scalability

To yield high utilization at scale, we need to feed enough

data (computation), which results in large batch size

Validation (test) accuracy is sensitive to batch size.
Large batch size can result in degraded validation

accuracy

Layer-wise Adaptive Rate Scaling algorithm (LARS)1
Intuition: learning rate should be adjusted according

to the norm of the weights in each layer

10 [1] You, Yang, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. "ImageNet training in minutes." In Proceedings of the 47th International Conference on Parallel Processing, p. 1. ACM, 2018. Best paper

SLIDE 11

Scalable Training Algorithm

Using batch size of 32K while preserving validation

accuracy

11

SLIDE 12

Scalable Training Algorithm

Using batch size of 32K on Intel Xeon Phi 7250 (KNL) and Intel Xeon

Platinum 8160 (SKX) nodes

12 You, Yang, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. "ImageNet training in minutes." In Proceedings of the 47th International Conference on Parallel Processing, p. 1. ACM, 2018. Best paper

SLIDE 13

Scalability vs Data I/O

ResNet-50 with ImageNet on 16 Nvidia 1080Ti GPUs,

mini-batch of 64 per GPU

13

Tpt (imgs/sec) 2250 4500 6750 9000 Number of Nodes (4 GPUs per node) 1 4 8 16 Lustre Ideal 8704 4352 2176 544 2786 2467 270 447 544 2176 4352 8704

SLIDE 14

I/O on Lustre

14

Dataset # files # dirs total_size file_size ImageNet 1.3 million 2002 140 GB KB-MB Neural  Image 0.6 million 6 500 GB MB Reactor  Status 0.17 million 1 65 GB KB

SLIDE 15

Deep Learning I/O

DL’s long lasting, repeated, high volume, and highly

concurrent file access can easily saturate the metadata and data service of traditional shared file system.

ResNet-50 with Keras, TensorFlow, and Horovod on 16

nodes, each with 4 GPUs

128K stat() and readdir() operations with 64-way

concurrent access

117M stat(), open(), close() operations with 256-way

concurrent access

~180M read() operations with same concurrency
~ 8 hour duration

15 [DLS’19-2] Zhang, Z., Huang, L., Pauloski, J. G., Foster, Ian T., “Aggregating Local Storage for Scalable Deep Learning I/O”,   The 3rd Deep Learning on Supercomputers Workshop, in conjunction with SC’19, Denver, CO 

SLIDE 16

FanStore Design

FanStore is a transient runtime file

system that optimizes I/O for distributed DL training.1

Data is partitioned (optionally

compressed) and spread  across local storage space

File access functions are

intercepted and handled in  user space

Remote file access is in the

form of MPI round-trip   message

16 [1] Zhang, Z., Huang, L., Pauloski, J. G., Foster, Ian T., “Aggregating Local Storage for Scalable Deep Learning I/O”,   The 3rd Deep Learning on Supercomputers Workshop, in conjunction with SC’19, Denver, CO 

SLIDE 17

ResNet-50 GPU Results

17

2250 4500 6750 9000 Number of Nodes (4 GPUs per node) 1 4 8 16 Lustre Ideal FanStore

7867 4050 1902 544

8704 4352 2176 544 2786 2467 270 447 544 2176 4352 8704

4 Nvidia 1080Ti GPUs per node, mini-batch of 64 per

GPU

SLIDE 18

ResNet-50 CPU Results

18

4500 9000 13500 18000 Number of Nodes 1 64 128 256 512 FanStore Ideal 16384 8192 4096 2048 32 15109 7710 3901 1968 32 32 2048 4096 8192 16384

Intel Xeon Platinum 8160 nodes on Stampede2, mini-

batch size of 256 per node

SLIDE 19

Memory Error in Deep Learning

The impact of memory error on deep learning training is

unclear, due to its stochastic nature and mathematical properties

Difficult for computing centers or individual researchers to

make hardware procurement decisions

Difficult for users to estimate their confidence in training

correctness on ECC-free processors

Potential performance gain by not using ECC
To quantify the impact of memory errors on deep learning

training and to investigate alternative solutions for memory error detection

19 [Cluster’19] Zhang, Z. Huang, L., Huang, R., Xu, W., Katz, D. S., “Quantifying the Impact of Memory Errors in Deep learning”, IEEE Cluster 2019, Albuquerque, NM

SLIDE 20

Technical Approach

Focusing on impact from silent data corruption (SDC)
P(Failure) ≈ P(Failure, SDC) = P(Failure | SDC) x P(SDC)
To evaluate P(Failure | SDC)
Sampling in the experiment design space
Manually flipping the selected bit
Observing validation accuracy and training loss
Estimating P(Failure | SDC) via marginal probability

20

SLIDE 21

Testing Applications

21

App SW Version Node Device Memory Mem Usage Run Time ConvNet

nvcaffe 0.16.5 1 2x1080 Ti 11 GB 0.45 GB 4.5 mins

LRCN

caffe 1.0.0 1 1x1080 Ti 11 GB 3.9 GB 16 mins

ResNet50

Intel- Caffe 1.1.0 512 KNL 96 GB 18.4 GB 8 mins

SLIDE 22

Example of ConvNet with Cifar10

ConvNet with Cifar10 dataset baseline
50,000 training items/10,000

validation items

60,000 Iterations/120 epochs
Batch size: 100
Top-1 Test Accuracy Acceptable

range: 76.52% - 80.83%

Training Loss Acceptable range:

0.2594 - 0.4975

22

Parameter Value Iteration 200, 10200, 20200, 30200, 40200, 50200, 60000 Phase forward, backward Place data, model Layers 1, 2, …, 15 Parameter Layers 1, 2, …, 7 Data Position 0, mid, last Bit Position 31, 30, 29, 28, 27, 22 Repetition 3

SLIDE 23

Key Observations

Training failure is independent of iteration number
Used part of training process instead of complete runs.
Errors on less significant bits lead to less training failures
Convolution layers have the most training failures, so we

estimate the worst case failure rate assuming every layer is a convolution layer

Training loss in the immediate next iteration is an effective

signal to detect catastrophic SDCs.

23

SLIDE 24

Memory Error Impact on DL Training

24

App P(SDC) P(F|SDC) Scaling Factor P(F) Expected  Runs per Failure ConvNet 3.07 *10-6 1.76% 1 5.4 * 10-8 18.5 M ResNet50 5.89 *10-2 1.22% 9 7.18 * 10-4 1,610 LRCN 5.19 *10-3 0.61% 110 3.17 * 10-5 31,500

[Cluster’19] Zhang, Z. Huang, L., Huang, R., Xu, W., Katz, D. S., “Quantifying the Impact of Memory Errors in Deep learning”, IEEE Cluster 2019, Albuquerque, NM

SLIDE 25

Conclusions

Scalable deep learning is required for large and future

problems

Challenges for scaling out deep learning
How to maintain accuracy while maximize scalability?
Deep learning is also a big data problem with more

computations, i.e. both IO and computing intensive

Different hardware requirement

25

SLIDE 26

Discussions

Machine learning on HPC is a growing area and in need

a dedicated benchmark

Science applications
data set
Models
Running on HPC
accuracy and convergence
Communication and data bottleneck
System cost, e.g. acquisition and operations
Overall system balance, reliability

26

SLIDE 27

THANKS

Distributed Deep Learning tutorial
1:30 - 5pm Nov. 17 Room 201
Deep Learning on Supercomputers workshop
9:00 - 5:30 Nov. 17 Room 502
Contact
Weijia Xu: xwj@tacc.utexas.edu
General data inquiry: data@tacc.utexas.edu

27

Deep Learning on HPC: Performance Factors and Lessons Learned

Outline

Motivating Applications

Motivating Applications

within city limit of Austin

Motivating Applications

resolution generative adversarial network.

~24 hours on 16 NVIDIA V100 GPUs

Motivating Applications

More DL Applications at TACC

intensive.

Scale up vs. Scale out

The Race of ResNet50

2,048 KNL with 74.9% top-1 accuracy

Accuracy vs Scalability

data (computation), which results in large batch size

accuracy

to the norm of the weights in each layer

Scalable Training Algorithm

accuracy

Scalable Training Algorithm

Scalability vs Data I/O

mini-batch of 64 per GPU

I/O on Lustre

Deep Learning I/O

concurrent file access can easily saturate the metadata and data service of traditional shared file system.

nodes, each with 4 GPUs

concurrent access

concurrent access

FanStore Design

system that optimizes I/O for distributed DL training.1

compressed) and spread across local storage space

intercepted and handled in user space

form of MPI round-trip message

ResNet-50 GPU Results

GPU

ResNet-50 CPU Results

batch size of 256 per node

Memory Error in Deep Learning

unclear, due to its stochastic nature and mathematical properties

make hardware procurement decisions

correctness on ECC-free processors

training and to investigate alternative solutions for memory error detection

Technical Approach

Testing Applications

Example of ConvNet with Cifar10

Key Observations

estimate the worst case failure rate assuming every layer is a convolution layer

signal to detect catastrophic SDCs.

Memory Error Impact on DL Training

Conclusions

problems

computations, i.e. both IO and computing intensive

Discussions

a dedicated benchmark

THANKS

compressed) and spread  across local storage space

intercepted and handled in  user space

form of MPI round-trip   message