HyPar-Flow : Exploiting MPI and Keras for Scalable Hy brid- Par allel - - PowerPoint PPT Presentation

hypar flow exploiting mpi and keras for scalable hy brid
SMART_READER_LITE
LIVE PREVIEW

HyPar-Flow : Exploiting MPI and Keras for Scalable Hy brid- Par allel - - PowerPoint PPT Presentation

HyPar-Flow : Exploiting MPI and Keras for Scalable Hy brid- Par allel DNN Training with Tensor Flow Ammar Ahmad Awan, Arpan Jain , Quentin Anthony, Hari Subramoni, and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) Dept. of


slide-1
SLIDE 1

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow

Ammar Ahmad Awan, Arpan Jain, Quentin Anthony, Hari Subramoni, and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL)

  • Dept. of Computer Science and Engineering , The Ohio State University

{awan.10, jain.575, anthony.301, subramoni.1, panda.2}@osu.edu

slide-2
SLIDE 2

CSE 5194 2 Network Based Computing Laboratory High-Performance Deep Learning

  • Introduction and Motivation
  • Problems and Challenges
  • Key Contribution
  • Performance Characterization
  • Conclusion

Agenda

slide-3
SLIDE 3

CSE 5194 3 Network Based Computing Laboratory High-Performance Deep Learning

The Deep Learning (DL) Revolution

Sourc rce: https://thenewstack. k.io io/demys ystif ifyin ying-dee eep-le learnin ing-an and-ar artificial al-in intellig lligence/

Adopted from: http://www.deeplearningbook.org/contents/intro.html

AI

Machine Learning (ML) Deep Learning (DL)

Examples: MLPs, DNNs, Examples: Logistic Regression

  • Deep Learning – A technique to achieve Artificial Intelligence

– Uses Deep Neural Networks

slide-4
SLIDE 4

CSE 5194 4 Network Based Computing Laboratory High-Performance Deep Learning

Deep Learning meets Super Computers

Accelerator/CP Family Performance Share www.top500.org

  • NVIDIA GPUs - major force for accelerating DL workloads

– Comput putationa nal r requi quirement nt is increasing ng e expo pone nent ntially

Courtesy: https://openai.com/blog/ai-and-compute/

slide-5
SLIDE 5

CSE 5194 5 Network Based Computing Laboratory High-Performance Deep Learning

  • Data parallelism

– Horovod: TensorFlow, PyTorch, and MXNet – TensorFlow: tf.distribute.Strategy API – PyTorch: torch.nn.parallel.DistributedDataParallel

  • Model-parallelism and Hybrid-parallelism

– No framework-level support – Only LBANN supports it within the framework – Higher-level frameworks: Gpipe, Mesh-TensorFlow, etc.

How to make Training Faster?

slide-6
SLIDE 6

CSE 5194 6 Network Based Computing Laboratory High-Performance Deep Learning

  • Data Parallelism (most common)
  • Model and Hybrid Parallelism

(emerging)

  • ‘X’-Parallelism

– ‘X’—> Spatial, Channel, Filter, etc.

Distributed/Parallel Training Strategies for DNNs

Model Parallelism Data Parallelism Hybrid (Model and Data) Parallelism

Courtesy: http://engineering.skymind.io/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks

slide-7
SLIDE 7

CSE 5194 7 Network Based Computing Laboratory High-Performance Deep Learning

  • Data Parallelism – only for models

that fit the memory

  • Out-of-core models

– Deeper model  Better accuracy but more memory required!

  • Model parallelism can work for
  • ut-of-core models!
  • Designing a system for model-

parallelism is challenging

Why Model Parallelism?

slide-8
SLIDE 8

CSE 5194 8 Network Based Computing Laboratory High-Performance Deep Learning

  • Introduction and Motivation
  • Problems and Challenges
  • Key Contribution
  • Performance Characterization
  • Conclusion

Agenda

slide-9
SLIDE 9

CSE 5194 9 Network Based Computing Laboratory High-Performance Deep Learning

  • Defining a distributed model -- necessary but difficult

– requires knowledge of the model, communication library, and distributed hardware

  • Implementing distributed forward/back-propagation

– needed because partitions reside in different memory spaces and need explicit communication

  • Obtaining parallel speedup on an inherently sequential task

– forward pass followed by a backward pass – Limited opportunity for parallelism and scalability

  • Achieving scalability without losing out on a model’s accuracy

– Valid concern for all types of parallelism strategies

Major Problems

slide-10
SLIDE 10

CSE 5194 10 Network Based Computing Laboratory High-Performance Deep Learning

Research Challenges

Meet HyPar-Flow! Challenge-1: Model- Definition APIs and Framework-specific Features Challenge-2: Communication between Partitions and Replicas Challenge-3: Applying HPC Techniques to Improve Performance

slide-11
SLIDE 11

CSE 5194 11 Network Based Computing Laboratory High-Performance Deep Learning

  • Introduction and Motivation
  • Problems and Challenges
  • Key Contribution
  • Performance Evaluation
  • Conclusion

Agenda

slide-12
SLIDE 12

CSE 5194 12 Network Based Computing Laboratory High-Performance Deep Learning

Key Contribution: Propose, Design, and Evaluate HyPar-Flow

  • HyPar-Flow is practical (easy-to-use) and high-performance (uses MPI)

– Based on Keras models and exploits TF 2.0 Eager Execution – Leverages performance of MPI pt-to-pt. and collectives for communication

slide-13
SLIDE 13

CSE 5194 13 Network Based Computing Laboratory High-Performance Deep Learning

HyPar-Flow: Overview

slide-14
SLIDE 14

CSE 5194 14 Network Based Computing Laboratory High-Performance Deep Learning

  • Model Generator is crucial for

productivity

  • Load Balancer is crucial for

performance

  • Trainer – Core of Back-propagation

– Several system-level challenges – Communication of tensors – Blocking or non-blocking – Efficient pipelining is needed

  • Communication Engine

– Isolate communication interfaces – Unified Data, Model, and Hybrid Parallelism

HyPar-Flow: Components

slide-15
SLIDE 15

CSE 5194 15 Network Based Computing Laboratory High-Performance Deep Learning

Special Handling for Models with Skip Connections

slide-16
SLIDE 16

CSE 5194 16 Network Based Computing Laboratory High-Performance Deep Learning

  • Introduction and Motivation
  • Problems and Challenges
  • Key Contribution
  • Performance Characterization
  • Conclusion

Agenda

slide-17
SLIDE 17

CSE 5194 17 Network Based Computing Laboratory High-Performance Deep Learning

  • 3 Systems

– Frontera at Texas Advanced Computing Center (TACC) – Stampede2 (Skylake partition) at TACC – AMD EPYC: Local system with dual-socket AMD EPYC 7551 32-core processors.

  • Interconnect

– Frontera -- Mellanox InfiniBand HDR- 100 HCAs – Stampede2 -- Intel Omni-Path HFIs.

  • TensorFlow v1.13, MVAPICH2 2.3.2 on Frontera and Epyc, and Intel MPI 2018
  • n Stampede2
  • We use and modify model definitions for VGG and ResNet(s) from

keras.applications

Evaluation Setup

slide-18
SLIDE 18

CSE 5194 18 Network Based Computing Laboratory High-Performance Deep Learning

  • The following variants have been compared:

– SEQ (GT) - Sequential using tf.GradientTape (GT). – SEQ (MF) - Sequential using model.fit (MF). – SEQ (MF-E) - Sequential using model.fit (MF) and (E)ager Execution. – HF-MP (2)/(56) - HyPar-Flow model-parallel with 2/48 model-partitions.

Verifying the Correctness of HyPar-Flow

VGG-16 ResNet-110 ResNet-1k

slide-19
SLIDE 19

CSE 5194 19 Network Based Computing Laboratory High-Performance Deep Learning

  • ResNet-1k -- scales with batch size on one node as well as two nodes
  • Reason for scaling?

– Counter-intuitive for model-parallelism to scale better than data-parallelism – Poor CPU implementation?

Model/Hybrid Parallelism on single/two nodes

slide-20
SLIDE 20

CSE 5194 20 Network Based Computing Laboratory High-Performance Deep Learning

  • AmoebaNet -- different

architecture compared to ResNet(s)

  • More branches and skip

connections

  • Scales well using HyPar-Flow
  • Memory-hungry so single

node restricted to BatchSize=64

Hybrid Parallelism for AmoebaNet

slide-21
SLIDE 21

CSE 5194 21 Network Based Computing Laboratory High-Performance Deep Learning

  • CPU based results

– AMD EPYC – Intel Xeon

  • Excellent speedups for

– VGG-19 – ResNet-110 – ResNet-1000 (1k layers)

  • Able to train “future” models

– E.g. ResNet-5000 (a synthetic 5000-layer model we benchmarked)

HyPar-Flow (HF): Flexibility and Scalability

110x speedup on 128 Intel Xeon Skylake nodes (TACC Stampede2)

slide-22
SLIDE 22

CSE 5194 22 Network Based Computing Laboratory High-Performance Deep Learning

  • ResNet-1001 with variable batch size
  • Approach:

– 48 model-partitions for 56 cores – 512 model-replicas for 512 nodes – Total cores: 48 x 512 = 24,576

  • Speedup

– 253X on 256 nodes – 481X on 512 nodes

  • Scaling Efficiency

– 98% up to 256 nodes – 93.9% for 512 nodes

HyPar-Flow at Scale (512 nodes on TACC Frontera)

481x speedup on 512 Intel Xeon Skylake nodes (TACC Frontera)

slide-23
SLIDE 23

CSE 5194 23 Network Based Computing Laboratory High-Performance Deep Learning

  • Introduction and Motivation
  • Problems and Challenges
  • Key Contribution
  • Performance Evaluation
  • Conclusion

Agenda

slide-24
SLIDE 24

CSE 5194 24 Network Based Computing Laboratory High-Performance Deep Learning

  • In-depth analysis of Data/Model/Hybrid parallelism

– The need for model/hybrid parallelism -- larger models

  • Proposed and Designed HyPar-Flow

– Flexible and user-transparent system – Leverages existing technologies instead of reinventing anything – Keras, TensorFlow, and MPI for flexibility and scalability

  • Performance Evaluation on large systems

– Three HPC clusters including Frontera at TACC (#5 on Top500) – Three DNNs with diverse requirements and sizes (VGG, ResNet-110/1k, and AmoebaNet) – 93% scaling efficiency on 512 nodes (Frontera)

Conclusion

slide-25
SLIDE 25

GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training

Arpan Jain, Ammar A. Awan, Asmaa M. Aljuhani, Jahanzeb M. Hashmi, Quentin G. Anthony, Hari Subramoni, Dhabaleswar K. Panda, Raghu Machiraju, and Anil Parwani Network Based Computing Laboratory (NBCL)

  • Dept. of Computer Science and Engineering, The Ohio State University

{jain.575, awan.10, aljuhani.2, hashmi.29, anthony.301, subramoni.1, panda.2, machiraju.1}@osu.edu, anil.parwani@osumc.edu

slide-26
SLIDE 26

CSE 5194 26 Network Based Computing Laboratory High-Performance Deep Learning

  • Introduction and Motivation
  • Problems and Challenges
  • Key Contributions
  • Performance Evaluation
  • Conclusion

Agenda

slide-27
SLIDE 27

CSE 5194 27 Network Based Computing Laboratory High-Performance Deep Learning

  • Data-Parallelism– only for models

that fit the memory

  • Out-of-core models

– Deeper model  Better accuracy but more memory required!

  • Model parallelism can work for
  • ut-of-core models!
  • Performance is questionable!

Why Model Parallelism?

slide-28
SLIDE 28

CSE 5194 28 Network Based Computing Laboratory High-Performance Deep Learning

Digital Pathology

A whole slide image (WSI) A tile at 10x magnification level A tile at 20x magnification level

  • Whole Slide Images (WSI)

– Replacing the glass slide for diagnostic purposes – Typically, 100,000 X 100,000 pixels in size

slide-29
SLIDE 29

CSE 5194 29 Network Based Computing Laboratory High-Performance Deep Learning

  • Introduction and Motivation
  • Problems and Challenges
  • Key Contributions
  • Performance Evaluation
  • Conclusion

Agenda

slide-30
SLIDE 30

CSE 5194 30 Network Based Computing Laboratory High-Performance Deep Learning

Why do we need Memory aware designs?

– Data and Model Parallel training has limitation! – Maximum Batch Size depends on the memory. – Basic Model Parallelism suffers from underutilization of memory and compute 

Problem with Model Parallelism

Memory requirement increases with the increase in image size!

slide-31
SLIDE 31

CSE 5194 31 Network Based Computing Laboratory High-Performance Deep Learning

Research Challenges

Meet GEMS! Challenge-1: GPU- based Communication in TensorFlow Challenge-2: Memory management in TensorFlow Challenge-3: Scaling Memory-Aware solutions

slide-32
SLIDE 32

CSE 5194 32 Network Based Computing Laboratory High-Performance Deep Learning

  • Introduction and Motivation
  • Problems and Challenges
  • Key Contributions
  • Performance Evaluation
  • Conclusion

Agenda

slide-33
SLIDE 33

CSE 5194 33 Network Based Computing Laboratory High-Performance Deep Learning

  • Propose, Design, and Evaluate GEMS: an integrated system that provides

memory-efficient model parallel training and scalable hybrid parallel training.

  • Propose several design schemes

– Basic Model Parallelism (GEMS-Basic) – Memory Aware Synchronized Training (GEMS-MAST) – Memory Aware Synchronized Training with Enhanced Replications (GEMS-MASTER) – Combination of Model and Data Parallel Training (GEMS-Hybrid)

  • Enabled training of High-level TCV classifier on 1024 X 1024 image tiles
  • Reduced training time from 7.25 hours to 28 minutes for out-of-core training on 128

Volta V100 GPUs.

Key Contributions

slide-34
SLIDE 34

CSE 5194 34 Network Based Computing Laboratory High-Performance Deep Learning

GEMS-MAST: Memory Aware Synchronized Training

  • GEMS-MAST

– Uses free memory and compute available between training steps – Leverages performance of MPI pt-to-pt. and collectives for communication

slide-35
SLIDE 35

CSE 5194 35 Network Based Computing Laboratory High-Performance Deep Learning

GEMS-MAST vs GEMS-MASTER

slide-36
SLIDE 36

CSE 5194 36 Network Based Computing Laboratory High-Performance Deep Learning

GEMS-Hybrid

  • GEMS-HY MAST & GEMS-HY MASTER

– Segmented Allreduce

slide-37
SLIDE 37

CSE 5194 37 Network Based Computing Laboratory High-Performance Deep Learning

  • Introduction and Motivation
  • Problems and Challenges
  • Key Contributions
  • Performance Evaluation
  • Conclusion

Agenda

slide-38
SLIDE 38

CSE 5194 38 Network Based Computing Laboratory High-Performance Deep Learning

  • System

– Lassen at Lawrence Livermore National Laboratory (LLNL)

  • POWER9 processor
  • 4 NVIDIA Volta V100 GPUs per node
  • Interconnect

– X Bus to connect two NUMA Nodes – NVLink is used to connect GPU-GPU and GPU-Processor – Infiband EDR

  • TensorFlow v1.14, MVAPICH2-GDR 2.3.3
  • We use and modify model definitions for ResNet(s) from keras.applications

Evaluation Setup

slide-39
SLIDE 39

CSE 5194 39 Network Based Computing Laboratory High-Performance Deep Learning

  • Evaluated ResNet with different number
  • f layers.
  • Effective Batch Size: 1
  • Image Size: 1024 X 1024
  • Speedup

– ResNet-110: 1.19x – ResNet-164: 1.31x – ResNet-218: 1.36x – ResNet-326: 1.36x

  • Theoretically, the maximum speedup

possible with GEMS-MAST is always less than 1.5x

GEMS-MAST

1.36x

slide-40
SLIDE 40

CSE 5194 40 Network Based Computing Laboratory High-Performance Deep Learning

  • Enables training with any batch size on same number of resources
  • Speedup

– ResNet-164 on 1024 X 1024 image size: Up to 1.83x – ResNet-1k on 512 X 512 image size: Up to 1.85x

GEMS-MASTER

1.83x

  • Proposed Analytical model to calculate the amount of improvement possible

4 GPUs 8 GPUs

slide-41
SLIDE 41

CSE 5194 41 Network Based Computing Laboratory High-Performance Deep Learning

GEMS-Hybrid Scalability

  • Scaled ResNet-1k to 1024 GPUs
  • Image Size: 512 X 512
  • Trainable on 8 GPUs with EBS=1
  • Number of DP replicas: 128
  • Speedup

– Ideal: 128x – GEMS-HY MAST: 89x – GEMS-HY MASTER: 124.58x

  • Scaling Efficiency

– GEMS-HY MASTER: 97.32% 0.25 0.5 1 2 4 8 16 32 64 128 8 16 32 64 128 256 512 1024 Images per sec (Higher is better) Number of GPUs GEMS-HY Basic GEMS-HY MAST GEMS-HY MASTER

slide-42
SLIDE 42

CSE 5194 42 Network Based Computing Laboratory High-Performance Deep Learning

  • Pathology whole slide image (WSI)

– Each WSI = 100,000 x 100,000 pixels – Can not fit in a single GPU memory – Tiles are extracted to make training possible

  • Two main problems with tiles

– Restricted tile size because of GPU memory limitation – Smaller tiles loose structural information

  • Reduced training time significantly

– GEMS-Basic: 7.25 hours (1 node, 4 GPUs) – GEMS-MAST: 6.28 hours (1 node, 4 GPUs) – GEMS-MASTER: 4.21 hours (1 node, 4 GPUs) – GEMS-Hybrid: 0.46 hours (32 nodes, 128 GPUs)

Exploiting GEMS in AI-Driven Digital Pathology

Courtesy: https://blog.kitware.com/digital-slide- archive-large-image-and-histomicstk-open-source- informatics-tools-for-management-visualization-and- analysis-of-digital-histopathology-data/

slide-43
SLIDE 43

CSE 5194 43 Network Based Computing Laboratory High-Performance Deep Learning

  • Introduction and Motivation
  • Problems and Challenges
  • Key Contributions
  • Performance Evaluation
  • Conclusion

Agenda

slide-44
SLIDE 44

CSE 5194 44 Network Based Computing Laboratory High-Performance Deep Learning

  • Proposed and Designed GEMS

– Proposed memory aware designs

  • GEMS MAST & GEMS MASTER

– GEMS-Hybrid to scale GEMS to multiple nodes – Keras, TensorFlow, and MPI for flexibility and scalability

  • Performance Evaluation on large systems

– Up to 1.85x speedup for out-of-core DNNS – Scaled GEMS to 1024 V100 GPUs on LLNL Lassen – Achieved 97.32% scaling efficiency with GEMS-Hybrid

  • Future Work

– Use GEMS to train out-of-core DNNs on larger image tiles – Extend GEMS to PyTorch

Conclusion

slide-45
SLIDE 45

CSE 5194 45 Network Based Computing Laboratory High-Performance Deep Learning

Thank You!

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ High Performance Deep Learning http://hidl.cse.ohio-state.edu/

{jain.575, awan.10, aljuhani.2, hashmi.29, anthony.301, subramoni.1, panda.2, machiraju.1}@osu.edu, anil.parwani@osumc.edu

The High-Performance MPI/PGAS Project http://mvapich.cse.ohio-state.edu/ The High-Performance Deep Learning Project http://hidl.cse.ohio-state.edu/