[PPT] - HyPar-Flow : Exploiting MPI and Keras for Scalable Hy brid- Par allel PowerPoint Presentation

SLIDE 1

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow

Ammar Ahmad Awan, Arpan Jain, Quentin Anthony, Hari Subramoni, and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL)

Dept. of Computer Science and Engineering , The Ohio State University

{awan.10, jain.575, anthony.301, subramoni.1, panda.2}@osu.edu

SLIDE 2

CSE 5194 2 Network Based Computing Laboratory High-Performance Deep Learning

Introduction and Motivation
Problems and Challenges
Key Contribution
Performance Characterization
Conclusion

Agenda

SLIDE 3

CSE 5194 3 Network Based Computing Laboratory High-Performance Deep Learning

The Deep Learning (DL) Revolution

Sourc rce: https://thenewstack. k.io io/demys ystif ifyin ying-dee eep-le learnin ing-an and-ar artificial al-in intellig lligence/

Adopted from: http://www.deeplearningbook.org/contents/intro.html

AI

Machine Learning (ML) Deep Learning (DL)

Examples: MLPs, DNNs, Examples: Logistic Regression

Deep Learning – A technique to achieve Artificial Intelligence

– Uses Deep Neural Networks

SLIDE 4

CSE 5194 4 Network Based Computing Laboratory High-Performance Deep Learning

Deep Learning meets Super Computers

Accelerator/CP Family Performance Share www.top500.org

NVIDIA GPUs - major force for accelerating DL workloads

– Comput putationa nal r requi quirement nt is increasing ng e expo pone nent ntially

Courtesy: https://openai.com/blog/ai-and-compute/

SLIDE 5

CSE 5194 5 Network Based Computing Laboratory High-Performance Deep Learning

Data parallelism

– Horovod: TensorFlow, PyTorch, and MXNet – TensorFlow: tf.distribute.Strategy API – PyTorch: torch.nn.parallel.DistributedDataParallel

Model-parallelism and Hybrid-parallelism

– No framework-level support – Only LBANN supports it within the framework – Higher-level frameworks: Gpipe, Mesh-TensorFlow, etc.

How to make Training Faster?

SLIDE 6

CSE 5194 6 Network Based Computing Laboratory High-Performance Deep Learning

Data Parallelism (most common)
Model and Hybrid Parallelism

(emerging)

‘X’-Parallelism

– ‘X’—> Spatial, Channel, Filter, etc.

Distributed/Parallel Training Strategies for DNNs

Model Parallelism Data Parallelism Hybrid (Model and Data) Parallelism

Courtesy: http://engineering.skymind.io/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks

SLIDE 7

CSE 5194 7 Network Based Computing Laboratory High-Performance Deep Learning

Data Parallelism – only for models

that fit the memory

Out-of-core models

– Deeper model  Better accuracy but more memory required!

Model parallelism can work for
ut-of-core models!
Designing a system for model-

parallelism is challenging

Why Model Parallelism?

SLIDE 8

CSE 5194 8 Network Based Computing Laboratory High-Performance Deep Learning

Introduction and Motivation
Problems and Challenges
Key Contribution
Performance Characterization
Conclusion

Agenda

SLIDE 9

CSE 5194 9 Network Based Computing Laboratory High-Performance Deep Learning

Defining a distributed model -- necessary but difficult

– requires knowledge of the model, communication library, and distributed hardware

Implementing distributed forward/back-propagation

– needed because partitions reside in different memory spaces and need explicit communication

Obtaining parallel speedup on an inherently sequential task

– forward pass followed by a backward pass – Limited opportunity for parallelism and scalability

Achieving scalability without losing out on a model’s accuracy

– Valid concern for all types of parallelism strategies

Major Problems

SLIDE 10

CSE 5194 10 Network Based Computing Laboratory High-Performance Deep Learning

Research Challenges

Meet HyPar-Flow! Challenge-1: Model- Definition APIs and Framework-specific Features Challenge-2: Communication between Partitions and Replicas Challenge-3: Applying HPC Techniques to Improve Performance

SLIDE 11

CSE 5194 11 Network Based Computing Laboratory High-Performance Deep Learning

Introduction and Motivation
Problems and Challenges
Key Contribution
Performance Evaluation
Conclusion

Agenda

SLIDE 12

CSE 5194 12 Network Based Computing Laboratory High-Performance Deep Learning

Key Contribution: Propose, Design, and Evaluate HyPar-Flow

HyPar-Flow is practical (easy-to-use) and high-performance (uses MPI)

– Based on Keras models and exploits TF 2.0 Eager Execution – Leverages performance of MPI pt-to-pt. and collectives for communication

SLIDE 13

CSE 5194 13 Network Based Computing Laboratory High-Performance Deep Learning

HyPar-Flow: Overview

SLIDE 14

CSE 5194 14 Network Based Computing Laboratory High-Performance Deep Learning

Model Generator is crucial for

productivity

Load Balancer is crucial for

performance

Trainer – Core of Back-propagation

– Several system-level challenges – Communication of tensors – Blocking or non-blocking – Efficient pipelining is needed

Communication Engine

– Isolate communication interfaces – Unified Data, Model, and Hybrid Parallelism

HyPar-Flow: Components

SLIDE 15

CSE 5194 15 Network Based Computing Laboratory High-Performance Deep Learning

Special Handling for Models with Skip Connections

SLIDE 16

CSE 5194 16 Network Based Computing Laboratory High-Performance Deep Learning

Introduction and Motivation
Problems and Challenges
Key Contribution
Performance Characterization
Conclusion

Agenda

SLIDE 17

CSE 5194 17 Network Based Computing Laboratory High-Performance Deep Learning

3 Systems

– Frontera at Texas Advanced Computing Center (TACC) – Stampede2 (Skylake partition) at TACC – AMD EPYC: Local system with dual-socket AMD EPYC 7551 32-core processors.

Interconnect

– Frontera -- Mellanox InfiniBand HDR- 100 HCAs – Stampede2 -- Intel Omni-Path HFIs.

TensorFlow v1.13, MVAPICH2 2.3.2 on Frontera and Epyc, and Intel MPI 2018
n Stampede2
We use and modify model definitions for VGG and ResNet(s) from

keras.applications

Evaluation Setup

SLIDE 18

CSE 5194 18 Network Based Computing Laboratory High-Performance Deep Learning

The following variants have been compared:

– SEQ (GT) - Sequential using tf.GradientTape (GT). – SEQ (MF) - Sequential using model.fit (MF). – SEQ (MF-E) - Sequential using model.fit (MF) and (E)ager Execution. – HF-MP (2)/(56) - HyPar-Flow model-parallel with 2/48 model-partitions.

Verifying the Correctness of HyPar-Flow

VGG-16 ResNet-110 ResNet-1k

SLIDE 19

CSE 5194 19 Network Based Computing Laboratory High-Performance Deep Learning

ResNet-1k -- scales with batch size on one node as well as two nodes
Reason for scaling?

– Counter-intuitive for model-parallelism to scale better than data-parallelism – Poor CPU implementation?

Model/Hybrid Parallelism on single/two nodes

SLIDE 20

CSE 5194 20 Network Based Computing Laboratory High-Performance Deep Learning

AmoebaNet -- different

architecture compared to ResNet(s)

More branches and skip

connections

Scales well using HyPar-Flow
Memory-hungry so single

node restricted to BatchSize=64

Hybrid Parallelism for AmoebaNet

SLIDE 21

CSE 5194 21 Network Based Computing Laboratory High-Performance Deep Learning

CPU based results

– AMD EPYC – Intel Xeon

Excellent speedups for

– VGG-19 – ResNet-110 – ResNet-1000 (1k layers)

Able to train “future” models

– E.g. ResNet-5000 (a synthetic 5000-layer model we benchmarked)

HyPar-Flow (HF): Flexibility and Scalability

110x speedup on 128 Intel Xeon Skylake nodes (TACC Stampede2)

SLIDE 22

CSE 5194 22 Network Based Computing Laboratory High-Performance Deep Learning

ResNet-1001 with variable batch size
Approach:

– 48 model-partitions for 56 cores – 512 model-replicas for 512 nodes – Total cores: 48 x 512 = 24,576

Speedup

– 253X on 256 nodes – 481X on 512 nodes

Scaling Efficiency

– 98% up to 256 nodes – 93.9% for 512 nodes

HyPar-Flow at Scale (512 nodes on TACC Frontera)

481x speedup on 512 Intel Xeon Skylake nodes (TACC Frontera)

SLIDE 23

CSE 5194 23 Network Based Computing Laboratory High-Performance Deep Learning

Introduction and Motivation
Problems and Challenges
Key Contribution
Performance Evaluation
Conclusion

Agenda

SLIDE 24

CSE 5194 24 Network Based Computing Laboratory High-Performance Deep Learning

In-depth analysis of Data/Model/Hybrid parallelism

– The need for model/hybrid parallelism -- larger models

Proposed and Designed HyPar-Flow

– Flexible and user-transparent system – Leverages existing technologies instead of reinventing anything – Keras, TensorFlow, and MPI for flexibility and scalability

Performance Evaluation on large systems

– Three HPC clusters including Frontera at TACC (#5 on Top500) – Three DNNs with diverse requirements and sizes (VGG, ResNet-110/1k, and AmoebaNet) – 93% scaling efficiency on 512 nodes (Frontera)

Conclusion

SLIDE 25

GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training

Arpan Jain, Ammar A. Awan, Asmaa M. Aljuhani, Jahanzeb M. Hashmi, Quentin G. Anthony, Hari Subramoni, Dhabaleswar K. Panda, Raghu Machiraju, and Anil Parwani Network Based Computing Laboratory (NBCL)

Dept. of Computer Science and Engineering, The Ohio State University

{jain.575, awan.10, aljuhani.2, hashmi.29, anthony.301, subramoni.1, panda.2, machiraju.1}@osu.edu, anil.parwani@osumc.edu

SLIDE 26

CSE 5194 26 Network Based Computing Laboratory High-Performance Deep Learning

Introduction and Motivation
Problems and Challenges
Key Contributions
Performance Evaluation
Conclusion

Agenda

SLIDE 27

CSE 5194 27 Network Based Computing Laboratory High-Performance Deep Learning

Data-Parallelism– only for models

that fit the memory

Out-of-core models

– Deeper model  Better accuracy but more memory required!

Model parallelism can work for
ut-of-core models!
Performance is questionable!

Why Model Parallelism?

SLIDE 28

CSE 5194 28 Network Based Computing Laboratory High-Performance Deep Learning

Digital Pathology

A whole slide image (WSI) A tile at 10x magnification level A tile at 20x magnification level

Whole Slide Images (WSI)

– Replacing the glass slide for diagnostic purposes – Typically, 100,000 X 100,000 pixels in size

SLIDE 29

CSE 5194 29 Network Based Computing Laboratory High-Performance Deep Learning

Introduction and Motivation
Problems and Challenges
Key Contributions
Performance Evaluation
Conclusion

Agenda

SLIDE 30

CSE 5194 30 Network Based Computing Laboratory High-Performance Deep Learning

Why do we need Memory aware designs?

– Data and Model Parallel training has limitation! – Maximum Batch Size depends on the memory. – Basic Model Parallelism suffers from underutilization of memory and compute 

Problem with Model Parallelism

Memory requirement increases with the increase in image size!

SLIDE 31

CSE 5194 31 Network Based Computing Laboratory High-Performance Deep Learning

Research Challenges

Meet GEMS! Challenge-1: GPU- based Communication in TensorFlow Challenge-2: Memory management in TensorFlow Challenge-3: Scaling Memory-Aware solutions

SLIDE 32

CSE 5194 32 Network Based Computing Laboratory High-Performance Deep Learning

Introduction and Motivation
Problems and Challenges
Key Contributions
Performance Evaluation
Conclusion

Agenda

SLIDE 33

CSE 5194 33 Network Based Computing Laboratory High-Performance Deep Learning

Propose, Design, and Evaluate GEMS: an integrated system that provides

memory-efficient model parallel training and scalable hybrid parallel training.

Propose several design schemes

– Basic Model Parallelism (GEMS-Basic) – Memory Aware Synchronized Training (GEMS-MAST) – Memory Aware Synchronized Training with Enhanced Replications (GEMS-MASTER) – Combination of Model and Data Parallel Training (GEMS-Hybrid)

Enabled training of High-level TCV classifier on 1024 X 1024 image tiles
Reduced training time from 7.25 hours to 28 minutes for out-of-core training on 128

Volta V100 GPUs.

Key Contributions

SLIDE 34

CSE 5194 34 Network Based Computing Laboratory High-Performance Deep Learning

GEMS-MAST: Memory Aware Synchronized Training

GEMS-MAST

– Uses free memory and compute available between training steps – Leverages performance of MPI pt-to-pt. and collectives for communication

SLIDE 35

CSE 5194 35 Network Based Computing Laboratory High-Performance Deep Learning

GEMS-MAST vs GEMS-MASTER

SLIDE 36

CSE 5194 36 Network Based Computing Laboratory High-Performance Deep Learning

GEMS-Hybrid

GEMS-HY MAST & GEMS-HY MASTER

– Segmented Allreduce

SLIDE 37

CSE 5194 37 Network Based Computing Laboratory High-Performance Deep Learning

Introduction and Motivation
Problems and Challenges
Key Contributions
Performance Evaluation
Conclusion

Agenda

SLIDE 38

CSE 5194 38 Network Based Computing Laboratory High-Performance Deep Learning

System

– Lassen at Lawrence Livermore National Laboratory (LLNL)

POWER9 processor
4 NVIDIA Volta V100 GPUs per node
Interconnect

– X Bus to connect two NUMA Nodes – NVLink is used to connect GPU-GPU and GPU-Processor – Infiband EDR

TensorFlow v1.14, MVAPICH2-GDR 2.3.3
We use and modify model definitions for ResNet(s) from keras.applications

Evaluation Setup

SLIDE 39

CSE 5194 39 Network Based Computing Laboratory High-Performance Deep Learning

Evaluated ResNet with different number
f layers.
Effective Batch Size: 1
Image Size: 1024 X 1024
Speedup

– ResNet-110: 1.19x – ResNet-164: 1.31x – ResNet-218: 1.36x – ResNet-326: 1.36x

Theoretically, the maximum speedup

possible with GEMS-MAST is always less than 1.5x

GEMS-MAST

1.36x

SLIDE 40

CSE 5194 40 Network Based Computing Laboratory High-Performance Deep Learning

Enables training with any batch size on same number of resources
Speedup

– ResNet-164 on 1024 X 1024 image size: Up to 1.83x – ResNet-1k on 512 X 512 image size: Up to 1.85x

GEMS-MASTER

1.83x

Proposed Analytical model to calculate the amount of improvement possible

4 GPUs 8 GPUs

SLIDE 41

CSE 5194 41 Network Based Computing Laboratory High-Performance Deep Learning

GEMS-Hybrid Scalability

Scaled ResNet-1k to 1024 GPUs
Image Size: 512 X 512
Trainable on 8 GPUs with EBS=1
Number of DP replicas: 128
Speedup

– Ideal: 128x – GEMS-HY MAST: 89x – GEMS-HY MASTER: 124.58x

Scaling Efficiency

– GEMS-HY MASTER: 97.32% 0.25 0.5 1 2 4 8 16 32 64 128 8 16 32 64 128 256 512 1024 Images per sec (Higher is better) Number of GPUs GEMS-HY Basic GEMS-HY MAST GEMS-HY MASTER

SLIDE 42

CSE 5194 42 Network Based Computing Laboratory High-Performance Deep Learning

Pathology whole slide image (WSI)

– Each WSI = 100,000 x 100,000 pixels – Can not fit in a single GPU memory – Tiles are extracted to make training possible

Two main problems with tiles

– Restricted tile size because of GPU memory limitation – Smaller tiles loose structural information

Reduced training time significantly

– GEMS-Basic: 7.25 hours (1 node, 4 GPUs) – GEMS-MAST: 6.28 hours (1 node, 4 GPUs) – GEMS-MASTER: 4.21 hours (1 node, 4 GPUs) – GEMS-Hybrid: 0.46 hours (32 nodes, 128 GPUs)

Exploiting GEMS in AI-Driven Digital Pathology

Courtesy: https://blog.kitware.com/digital-slide- archive-large-image-and-histomicstk-open-source- informatics-tools-for-management-visualization-and- analysis-of-digital-histopathology-data/

SLIDE 43

CSE 5194 43 Network Based Computing Laboratory High-Performance Deep Learning

Introduction and Motivation
Problems and Challenges
Key Contributions
Performance Evaluation
Conclusion

Agenda

SLIDE 44

CSE 5194 44 Network Based Computing Laboratory High-Performance Deep Learning

Proposed and Designed GEMS

– Proposed memory aware designs

GEMS MAST & GEMS MASTER

– GEMS-Hybrid to scale GEMS to multiple nodes – Keras, TensorFlow, and MPI for flexibility and scalability

Performance Evaluation on large systems

– Up to 1.85x speedup for out-of-core DNNS – Scaled GEMS to 1024 V100 GPUs on LLNL Lassen – Achieved 97.32% scaling efficiency with GEMS-Hybrid

Future Work

– Use GEMS to train out-of-core DNNs on larger image tiles – Extend GEMS to PyTorch

Conclusion

SLIDE 45

CSE 5194 45 Network Based Computing Laboratory High-Performance Deep Learning

Thank You!

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ High Performance Deep Learning http://hidl.cse.ohio-state.edu/

{jain.575, awan.10, aljuhani.2, hashmi.29, anthony.301, subramoni.1, panda.2, machiraju.1}@osu.edu, anil.parwani@osumc.edu

The High-Performance MPI/PGAS Project http://mvapich.cse.ohio-state.edu/ The High-Performance Deep Learning Project http://hidl.cse.ohio-state.edu/