[PPT] - Scalable Distributed Training with Parameter Hub: a whirlwind tour PowerPoint Presentation

SLIDE 1

Scalable Distributed Training with Parameter Hub:  a whirlwind tour

SLIDE 2

TVM Stack

Optimization AutoTVM AutoVTA High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC Hardware Fleet

SLIDE 3

Optimization AutoTVM AutoVTA High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC Hardware Fleet

Active Topology Probing

Your Cloud

Groundwork for bringing TVM to the distributed world for training and inference, on commercial cloud, or in your own cluster.

SLIDE 4

Parameter Hub

Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 4

Optimized, topology-aware and dynamic mechanism for inter-machine communication

SLIDE 5

Parameter Hub

Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 4

* In the cloud-based training context

Optimized, topology-aware and dynamic mechanism for inter-machine communication

SLIDE 6

Deep Learning constitutes an important workload in cloud today. Major cloud providers all have an ecosystem for cloud learning.

5

SLIDE 7

Deep Learning constitutes an important workload in cloud today. Major cloud providers all have an ecosystem for cloud learning.

5

SLIDE 8 6

Server demand for DL inference across data centers nearly quadrupled in less than 2 years. Source: Facebook

SLIDE 9 6

Server demand for DL inference across data centers nearly quadrupled in less than 2 years. Source: Facebook

SLIDE 10 7

EC2 reclaims your GPU instances as they run out of capacity

SLIDE 11 7

EC2 reclaims your GPU instances as they run out of capacity

SLIDE 12

Distributed Training

INDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE

F1 B1 A1 O1 F1 B1 Parameter Server Worker 1 Worker 2 F2 B2 F2 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization A2 O2 F3 F3 B2 B2 Time

Worker Parameter Server

8

SLIDE 13

Distributed Training

INDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE

F1 B1 A1 O1 F1 B1 Parameter Sever Worker 1 Worker 2 F2 B2 F2 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization A2 O2 F3 F3 B2 B2 Time

Worker Parameter Server

9

SLIDE 14

Distributed Training Today

IN THE CONTEXT OF THE CLOUD

Network Core ToR Machine with GPUs Machine ToR Machine with GPUs Machine

10

SLIDE 15

Distributed Training Today

FORWARD AND BACKWARD PASSES IN WORKER

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

11

SLIDE 16

Distributed Training Today

AGGREGATION AND OPTIMIZATION IN PS

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

12

SLIDE 17

Distributed training is communication bound

ResNet 269

Seconds

0.45 0.9 1.35 1.8 GRID 520 K80 M60 V100

Problem gets worse
ver time: shifting

bottleneck.

With modern GPUs

most of the time is spent

n communication.
Making GPUs faster will

do little to increase throughput

Wasting compute

resources.

13

2012 2014 2015 2017

GPU and Network active GPU idle, waiting on network

SLIDE 18

Distributed training is communication bound

Inception V3

AlexNet GoogleNet ResNet 269

SLIDE 19

Bottlenecks in DDNN training

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT.

15

SLIDE 20

Bottlenecks in DDNN training

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

FRAMEWORK BOTTLENECKS

16

SLIDE 21

Bottlenecks in DDNN training

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

FRAMEWORK BOTTLENECKS

GPU

Training Framework …

Network

16

SLIDE 22

Bottlenecks in DDNN training

FRAMEWORK BOTTLENECKS

17

SLIDE 23

Bottlenecks in DDNN training

FRAMEWORK BOTTLENECKS

ResNet 269 Inception GoogleNet AlexNet

Seconds 0.4 0.8 1.2 1.6

Compute Data Copy and Communication Aggregator Optimizer Synchronization and other Overheads

17

SLIDE 24

Bottlenecks in DDNN training

MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT.

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

18

SLIDE 25

Bottlenecks in DDNN training

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

BANDWIDTH BOTTLENECK

19

SLIDE 26

Bottlenecks in Cloud-based DDNN training

INSUFFICIENT BANDWIDTH

20

25 Gbps 10 Gbps 1000 Gbps 1300 Gbps Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet

SLIDE 27

Bottlenecks in Cloud-based DDNN training

INSUFFICIENT BANDWIDTH

20

25 Gbps 10 Gbps Cloud Bandwidth 1000 Gbps 1300 Gbps Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet

SLIDE 28

Bottlenecks in Cloud-based DDNN training

INSUFFICIENT BANDWIDTH

20

25 Gbps 10 Gbps Cloud Bandwidth 1000 Gbps 1300 Gbps GoogleNet / Inception: 40 Gbps Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet

SLIDE 29

Bottlenecks in Cloud-based DDNN training

INSUFFICIENT BANDWIDTH

20

25 Gbps 10 Gbps Cloud Bandwidth 1000 Gbps 1300 Gbps GoogleNet / Inception: 40 Gbps ResNet: 100 Gbps Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet

SLIDE 30

Bottlenecks in Cloud-based DDNN training

INSUFFICIENT BANDWIDTH

20

25 Gbps 10 Gbps Cloud Bandwidth 1000 Gbps 1300 Gbps GoogleNet / Inception: 40 Gbps ResNet: 100 Gbps AlexNet: 1200 Gbps Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet

SLIDE 31

Bottlenecks in Cloud-based DDNN training

MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT.

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

21

SLIDE 32

Bottlenecks in Cloud-based DDNN training

DEPLOYMENT-RELATED OVERHEAD

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

22

SLIDE 33

Bottlenecks in Cloud-based DDNN training

DEPLOYMENT-RELATED OVERHEAD

Transient congestion, or
versubscription by

design

Cross-rack

communication cost is higher than Intra-rack communication.

Hosts

9 Gbps 4 Gbps

1 2 3 4 5 6 7 8 1 2 4. 7 3 8. 9 4. 7 4 8. 9 4. 7 8. 9 5 8. 9 4. 7 8. 9 8. 9 6 4. 7 9. 4. 7 4. 7 4. 7 7 8. 9 4. 7 9. 8. 9 9. 4. 7

Cluster 1: 1 3 4 5 7 Cluster 2: 2 6 8

Hosts

23

SLIDE 34

Parameter Hub Optimizations

CODESIGNING SOFTWARE, HARDWARE WITH CLUSTER CONFIGURATION FOR EFFICIENT CLOUD- BASED DDNN TRAINING

ToR PS 1 ToR PS 2 Worker 2

24

SLIDE 35

Network Core

Eliminating framework bottlenecks:

ToR Worker 1 PS 1 ToR PS 2 Worker 2

GPU Data Copy Aggregation Optimization … Network

PHub Optimizations: streamlining DDNN training pipeline

25

SLIDE 36

Network Core

Eliminating framework bottlenecks:

ToR Worker 1 PS 1 ToR PS 2 Worker 2

GPU Data Copy Aggregation Optimization … Network

PHub Optimizations: streamlining DDNN training pipeline

26

SLIDE 37

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

GRADIENTS CPU

Software Optimizations

27

MEMORY

SLIDE 38

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

GRADIENTS CPU

Software Optimizations

27

MEMORY

SLIDE 39

Software Optimizations

GRADIENT AGGREGATION AND OPTIMIZATION

Each core reads the input Q from different workers and writes to different locations to the output queue For each input Q, launch a series of threads for

aggregation. This is used in
MxNet. (Wide Aggregation)

Requires synchronization. Great locality. No synchronization

Sequentially aggregates the same portion of gradients within each queue. (Tall Aggregation) Organize processors into

hierarchy. Perform NUMA

aware tree reduction.

NUMA 0 NUMA 1 Great locality. No synchronization Too much coherence and synchronization

28

SLIDE 40

Software Optimizations

GRADIENT AGGREGATION AND OPTIMIZATION

Each core reads the input Q from different workers and writes to different locations to the output queue For each input Q, launch a series of threads for

aggregation. This is used in
MxNet. (Wide Aggregation)

Requires synchronization. Great locality. No synchronization

Sequentially aggregates the same portion of gradients within each queue. (Tall Aggregation) Organize processors into

hierarchy. Perform NUMA

aware tree reduction.

NUMA 0 NUMA 1 Great locality. No synchronization Too much coherence and synchronization

28

SLIDE 41

Software Optimizations

TALL AGGREGATION AND OPTIMIZATION

Chunk a gradient into a series of virtual

gradients deterministically.

A virtual gradient is mapped to a

particular core on the server.

Gradient Array for Key 0 from 8 workers Core Mappings

29

SLIDE 42

Software Optimizations

TALL AGGREGATION AND OPTIMIZATION

Chunk a gradient into a series of virtual

gradients deterministically.

A virtual gradient is mapped to a

particular core on the server.

Gradient Array for Key 0 from 8 workers Core Mappings

29

SLIDE 43

Software Optimizations

TALL AGGREGATION AND OPTIMIZATION

Chunk a gradient into a series of virtual

gradients deterministically.

A virtual gradient is mapped to a

particular core on the server.

Virtual gradients are transferred

independently.

Gradient Array for Key 0 from 8 workers Core Mappings

29

SLIDE 44

Software Optimizations

TALL AGGREGATION AND OPTIMIZATION

Chunk a gradient into a series of virtual

gradients deterministically.

A virtual gradient is mapped to a

particular core on the server.

Virtual gradients are transferred

independently.

A chunk is only processed by a single

core : maintaining maximum locality.

Gradient Array for Key 0 from 8 workers Core Mappings Aggregated

29

SLIDE 45

Software Optimizations

TALL AGGREGATION AND OPTIMIZATION

When Aggregation is done, PHub:

PHub optimizes a chunk with the same

core that aggregates that chunk.

Array for Key 0 from 8 workers Aggregated

30

SLIDE 46

Software Optimizations

TALL AGGREGATION AND OPTIMIZATION

When Aggregation is done, PHub:

PHub optimizes a chunk with the same

core that aggregates that chunk.

FP32-level streaming aggregation and
ptimization to hide communication

latency.

Array for Key 0 from 8 workers Aggregated

31

SLIDE 47

Software Optimizations

TALL AGGREGATION AND OPTIMIZATION

When Aggregation is done, PHub:

PHub optimizes a chunk with the same

core that aggregates that chunk.

FP32-level streaming aggregation and
ptimization to hide communication

latency.

Array for Key 0 from 8 workers Optimized Aggregated

31

SLIDE 48

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic

32

SLIDE 49

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic

33

SLIDE 50

Two-Phase Hierarchical Aggregation

RACK SCALE PARAMETER SERVICE

Cluster Network ToR Rack Worker/PS 1 Worker/PS N ToR Worker/PS 1 Worker/PS 2 CM

34

SLIDE 51

Two-Phase Hierarchical Aggregation

RACK SCALE PARAMETER SERVICE

Cluster Network ToR Rack Worker/PS 1 Worker/PS N ToR Worker/PS 1 Worker/PS 2 CM Aggregator PBox

35

SLIDE 52

Two-Phase Hierarchical Aggregation

ADAPTING TO THE DATACENTER NETWORK TOPOLOGY

Cluster Network ToR Rack Worker/PS 1 Worker/PS N ToR Worker/PS 1 Worker/PS 2 Aggregator Aggregator

36

SLIDE 53

Two-Phase Hierarchical Aggregation

ADAPTING TO THE DATACENTER NETWORK TOPOLOGY

Cluster Network ToR Rack Worker/PS 1 Worker/PS N ToR Worker/PS 1 Worker/PS 2 Aggregator Aggregator

1. Intra-Rack central

aggregation

36

SLIDE 54

Two-Phase Hierarchical Aggregation

ADAPTING TO THE DATACENTER NETWORK TOPOLOGY

Cluster Network ToR Rack Worker/PS 1 Worker/PS N ToR Worker/PS 1 Worker/PS 2 Aggregator Aggregator

1. Intra-Rack central

aggregation

2. Inter-Rack

aggregation N times traffic reduction!

36

SLIDE 55

Efficient DDNN Training in Commercial Cloud

ACTIVE TOPOLOGY PROBING

VMs Azure/EC2

SLIDE 56

DPDK-based latency Probe

Efficient DDNN Training in Commercial Cloud

ACTIVE TOPOLOGY PROBING

VMs Azure/EC2

SLIDE 57

DPDK-based latency Probe

…

Distance Matrix

Efficient DDNN Training in Commercial Cloud

ACTIVE TOPOLOGY PROBING

VMs Azure/EC2

SLIDE 58

DPDK-based latency Probe

…

Distance Matrix Clustering Algorithms

Efficient DDNN Training in Commercial Cloud

ACTIVE TOPOLOGY PROBING

VMs Azure/EC2

SLIDE 59

DPDK-based latency Probe

…

Distance Matrix Clustering Algorithms Inferred Network Topology*

Efficient DDNN Training in Commercial Cloud

ACTIVE TOPOLOGY PROBING

VMs Azure/EC2

SLIDE 60

DPDK-based latency Probe

…

Distance Matrix Clustering Algorithms Inferred Network Topology* Automagic Schedule Generation

Efficient DDNN Training in Commercial Cloud

ACTIVE TOPOLOGY PROBING

VMs Azure/EC2

SLIDE 61

Hierarchical Reduction Plan DPDK-based latency Probe

…

Distance Matrix Clustering Algorithms Inferred Network Topology* Automagic Schedule Generation

Efficient DDNN Training in Commercial Cloud

ACTIVE TOPOLOGY PROBING

VMs Azure/EC2

SLIDE 62

Hierarchical Reduction Plan DPDK-based latency Probe

…

Distance Matrix Clustering Algorithms Inferred Network Topology* Automagic Schedule Generation

Efficient DDNN Training in Commercial Cloud

ACTIVE TOPOLOGY PROBING

VMs Azure/EC2

SLIDE 63

1 1 2 2

Azure (Standard NC6) EC2 (P3.2xlarge)

1.4 2.4 VS Facebook Gloo

Performance in commercial cloud with PHub

38

Windows Azure and Amazon EC2. 32 instances. Up to 10 Gbps. Standard_NC6: Nvidia

K80. Batch Size = 512. P3.2xLarge: Nvidia V100. Batch Size = 512. Facebook Caffe2/
Pytorch. ResNet 50.

3 5 8 10

Azure (Standard NC6) EC2 (P3.2xlarge)

1.8 9.6 VS Ring Reduction

SLIDE 64

Framework Integration

Support for Mxnet/Pytorch/Caffe2.

var pHub = std::make_shared<PHub>(cfg.redisIp, nMap, keySize, appAddrs, cntr, sizeof(float), cfg.rank, plp); pHub->ToggleUseSchedule(pSchedule); pHub->Reduce();

SLIDE 65

Optimization

Active Topology Probing

Your Cloud

Groundwork for bringing TVM to the distributed world for training and inference, on commercial cloud, or in your own cluster.

AutoTVM AutoVTA High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC Hardware Fleet

SLIDE 66

SLIDE 67

SLIDE 68

Hardware Parameter Hub

SLIDE 69

Hardware Parameter Hub

SLIDE 70

Hardware Parameter Hub

Balanced computation and communication resource.

10 ConnectX-3 Card
560+Gbps Network BW
800Gbps PCIe
Fully supported by Software

Parameter Hub

SLIDE 71

Hardware Parameter Hub

35GB/s aggregation throughput. Supports 100+ ResNet-50 training nodes with a single machine.

Gloo HD Gloo Ring PS-Lite PHub SW

2 4.5 5 7

SLIDE 72

Hardware Parameter Hub

ResNet-50. See paper for detailed estimates. Better training throughput/$.

SLIDE 73

Hardware Parameter Hub

ResNet-50. See paper for detailed estimates. Better training throughput/$.

Scalable Distributed Training with Parameter Hub: a whirlwind tour - - PowerPoint PPT Presentation

02 04 06 08 10 12 14 16 18 20 22 25 %