Deep Learning on HPC: Performance Factors and Lessons Learned
Weijia Xu Scalable Computational Intelligence Group Texas Advanced Computing Center The University of Texas at Austin BenchCounil’19 Denver, CO
1
Deep Learning on HPC: Performance Factors and Lessons Learned Weijia - - PowerPoint PPT Presentation
Deep Learning on HPC: Performance Factors and Lessons Learned Weijia Xu Scalable Computational Intelligence Group Texas Advanced Computing Center The University of Texas at Austin BenchCounil19 Denver, CO 1 Outline Motivating
Weijia Xu Scalable Computational Intelligence Group Texas Advanced Computing Center The University of Texas at Austin BenchCounil’19 Denver, CO
1
2
3 [1] “Deep learning methods to leverage traffic monitoring cameras for pedestrian data applications” Weijia Xu, Natalia Ruiz, Ruizhu Huang, Joel Meyer, and Jen Duthie, John Clary, 26th ITS Word Congress, (Best Technical Paper) [2] Detecting Pedestrian Crossing Events in Large Video Data form Traffic Monitoring Cameras, Weijia Xu, Natalia Ruiz, Kelly Pierce, Ruizhu Huang, Joel Meyer, and Jen Duthie, to appear IEEE BigData2019
4
5 Biorixv’19 Fang, L., Monroe, F ., Novak, S.W., Kirk, L., Schiavon, C.R., Seungyoon, B.Y., Zhang, T., Wu, M., Kastner, K., Kubota, Y. and Zhang, Z., 2019. Deep Learning-Based Point-Scanning Super-Resolution Imaging. bioRxiv, p.740548.
6 [1]DLS’19-1 Mattmann, Chris A., Zhang, Z.. Deep Facial Recognition with TensorFlow, The 3rd Deep Learning on Supercomputers Workshop, in conjunction with SC’19, Denver, CO [2] Courtesy image from: https://megapixels.cc/datasets/msceleb/
7
Specialized hardware, e.g. TPU
cluster
8
9
1 29 28 56 116 264 466 791 100 200 300 400 500 600 700 800 900 He et al. (Microsoft) Goyal et al. (Facebook) Cordeanu et
Intel) You et al (UBC & TACC) preferred networks Tencent & HKBU Sony Research Google 1024 TPUv3
ResNet 50 ImageNet Training Acceleration from 2016~2019
10 [1] You, Yang, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. "ImageNet training in minutes." In Proceedings of the 47th International Conference on Parallel Processing, p. 1. ACM, 2018. Best paper
11
Platinum 8160 (SKX) nodes
12 You, Yang, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. "ImageNet training in minutes." In Proceedings of the 47th International Conference on Parallel Processing, p. 1. ACM, 2018. Best paper
13
Tpt (imgs/sec) 2250 4500 6750 9000 Number of Nodes (4 GPUs per node) 1 4 8 16 Lustre Ideal 8704 4352 2176 544 2786 2467 270 447 544 2176 4352 8704
14
Dataset # files # dirs total_size file_size ImageNet 1.3 million 2002 140 GB KB-MB Neural Image 0.6 million 6 500 GB MB Reactor Status 0.17 million 1 65 GB KB
15 [DLS’19-2] Zhang, Z., Huang, L., Pauloski, J. G., Foster, Ian T., “Aggregating Local Storage for Scalable Deep Learning I/O”, The 3rd Deep Learning on Supercomputers Workshop, in conjunction with SC’19, Denver, CO
16 [1] Zhang, Z., Huang, L., Pauloski, J. G., Foster, Ian T., “Aggregating Local Storage for Scalable Deep Learning I/O”, The 3rd Deep Learning on Supercomputers Workshop, in conjunction with SC’19, Denver, CO
17
2250 4500 6750 9000 Number of Nodes (4 GPUs per node) 1 4 8 16 Lustre Ideal FanStore
7867 4050 1902 544
8704 4352 2176 544 2786 2467 270 447 544 2176 4352 8704
18
4500 9000 13500 18000 Number of Nodes 1 64 128 256 512 FanStore Ideal 16384 8192 4096 2048 32 15109 7710 3901 1968 32 32 2048 4096 8192 16384
19 [Cluster’19] Zhang, Z. Huang, L., Huang, R., Xu, W., Katz, D. S., “Quantifying the Impact of Memory Errors in Deep learning”, IEEE Cluster 2019, Albuquerque, NM
20
21
App SW Version Node Device Memory Mem Usage Run Time ConvNet
nvcaffe 0.16.5 1 2x1080 Ti 11 GB 0.45 GB 4.5 mins
LRCN
caffe 1.0.0 1 1x1080 Ti 11 GB 3.9 GB 16 mins
ResNet50
Intel- Caffe 1.1.0 512 KNL 96 GB 18.4 GB 8 mins
validation items
range: 76.52% - 80.83%
0.2594 - 0.4975
22
Parameter Value Iteration 200, 10200, 20200, 30200, 40200, 50200, 60000 Phase forward, backward Place data, model Layers 1, 2, …, 15 Parameter Layers 1, 2, …, 7 Data Position 0, mid, last Bit Position 31, 30, 29, 28, 27, 22 Repetition 3
23
24
App P(SDC) P(F|SDC) Scaling Factor P(F) Expected Runs per Failure ConvNet 3.07 *10-6 1.76% 1 5.4 * 10-8 18.5 M ResNet50 5.89 *10-2 1.22% 9 7.18 * 10-4 1,610 LRCN 5.19 *10-3 0.61% 110 3.17 * 10-5 31,500
[Cluster’19] Zhang, Z. Huang, L., Huang, R., Xu, W., Katz, D. S., “Quantifying the Impact of Memory Errors in Deep learning”, IEEE Cluster 2019, Albuquerque, NM
25
26
27