Scalable Distributed Training with Parameter Hub: a whirlwind tour - - PowerPoint PPT Presentation

scalable distributed training with parameter hub a
SMART_READER_LITE
LIVE PREVIEW

Scalable Distributed Training with Parameter Hub: a whirlwind tour - - PowerPoint PPT Presentation

Scalable Distributed Training with Parameter Hub: a whirlwind tour TVM Stack Optimization High-Level Differentiable IR AutoTVM Tensor Expression IR LLVM, CUDA, Metal VTA AutoVTA Edge Cloud Hardware ASIC FPGA FPGA Fleet


slide-1
SLIDE 1

Scalable Distributed Training with Parameter Hub:
 a whirlwind tour

slide-2
SLIDE 2

TVM Stack

Optimization AutoTVM AutoVTA High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC Hardware Fleet

slide-3
SLIDE 3

Optimization AutoTVM AutoVTA High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC Hardware Fleet

Active Topology Probing

Your Cloud

Groundwork for bringing TVM to the distributed world for training and inference, on commercial cloud, or in your own cluster.

slide-4
SLIDE 4

Parameter Hub

Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 4

Optimized, topology-aware and dynamic mechanism for inter-machine communication

slide-5
SLIDE 5

Parameter Hub

Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 4

* In the cloud-based training context

Optimized, topology-aware and dynamic mechanism for inter-machine communication

slide-6
SLIDE 6

Deep Learning constitutes an important workload in cloud today. Major cloud providers all have an ecosystem for cloud learning.

5
slide-7
SLIDE 7

Deep Learning constitutes an important workload in cloud today. Major cloud providers all have an ecosystem for cloud learning.

5
slide-8
SLIDE 8 6

Server demand for DL inference across data centers nearly quadrupled in less than 2 years. Source: Facebook

slide-9
SLIDE 9 6

Server demand for DL inference across data centers nearly quadrupled in less than 2 years. Source: Facebook

slide-10
SLIDE 10 7

EC2 reclaims your GPU instances as they run out of capacity

slide-11
SLIDE 11 7

EC2 reclaims your GPU instances as they run out of capacity

slide-12
SLIDE 12

Distributed Training

INDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE

F1 B1 A1 O1 F1 B1 Parameter Server Worker 1 Worker 2 F2 B2 F2 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization A2 O2 F3 F3 B2 B2 Time

Worker Parameter Server

8
slide-13
SLIDE 13

Distributed Training

INDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE

F1 B1 A1 O1 F1 B1 Parameter Sever Worker 1 Worker 2 F2 B2 F2 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization A2 O2 F3 F3 B2 B2 Time

Worker Parameter Server

9
slide-14
SLIDE 14

Distributed Training Today

IN THE CONTEXT OF THE CLOUD

Network Core ToR Machine with GPUs Machine ToR Machine with GPUs Machine

10
slide-15
SLIDE 15

Distributed Training Today

FORWARD AND BACKWARD PASSES IN WORKER

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

11
slide-16
SLIDE 16

Distributed Training Today

AGGREGATION AND OPTIMIZATION IN PS

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

12
slide-17
SLIDE 17

Distributed training is communication bound

ResNet 269

Seconds

0.45 0.9 1.35 1.8 GRID 520 K80 M60 V100

  • Problem gets worse
  • ver time: shifting

bottleneck.

  • With modern GPUs

most of the time is spent

  • n communication.
  • Making GPUs faster will

do little to increase throughput

  • Wasting compute

resources.

13

2012 2014 2015 2017

GPU and Network active GPU idle, waiting on network

slide-18
SLIDE 18

Distributed training is communication bound

Inception V3

AlexNet GoogleNet ResNet 269

slide-19
SLIDE 19

Bottlenecks in DDNN training

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT.

15
slide-20
SLIDE 20

Bottlenecks in DDNN training

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

FRAMEWORK BOTTLENECKS

16
slide-21
SLIDE 21

Bottlenecks in DDNN training

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

FRAMEWORK BOTTLENECKS

GPU

Training Framework …

Network

16
slide-22
SLIDE 22

Bottlenecks in DDNN training

FRAMEWORK BOTTLENECKS

17
slide-23
SLIDE 23

Bottlenecks in DDNN training

FRAMEWORK BOTTLENECKS

ResNet 269 Inception GoogleNet AlexNet

Seconds 0.4 0.8 1.2 1.6

Compute Data Copy and Communication Aggregator Optimizer Synchronization and other Overheads

17
slide-24
SLIDE 24

Bottlenecks in DDNN training

MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT.

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

18
slide-25
SLIDE 25

Bottlenecks in DDNN training

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

BANDWIDTH BOTTLENECK

19
slide-26
SLIDE 26

Bottlenecks in Cloud-based DDNN training

INSUFFICIENT BANDWIDTH

20

25 Gbps 10 Gbps 1000 Gbps 1300 Gbps Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet

slide-27
SLIDE 27

Bottlenecks in Cloud-based DDNN training

INSUFFICIENT BANDWIDTH

20

25 Gbps 10 Gbps Cloud Bandwidth 1000 Gbps 1300 Gbps Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet

slide-28
SLIDE 28

Bottlenecks in Cloud-based DDNN training

INSUFFICIENT BANDWIDTH

20

25 Gbps 10 Gbps Cloud Bandwidth 1000 Gbps 1300 Gbps GoogleNet / Inception: 40 Gbps Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet

slide-29
SLIDE 29

Bottlenecks in Cloud-based DDNN training

INSUFFICIENT BANDWIDTH

20

25 Gbps 10 Gbps Cloud Bandwidth 1000 Gbps 1300 Gbps GoogleNet / Inception: 40 Gbps ResNet: 100 Gbps Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet

slide-30
SLIDE 30

Bottlenecks in Cloud-based DDNN training

INSUFFICIENT BANDWIDTH

20

25 Gbps 10 Gbps Cloud Bandwidth 1000 Gbps 1300 Gbps GoogleNet / Inception: 40 Gbps ResNet: 100 Gbps AlexNet: 1200 Gbps Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet

slide-31
SLIDE 31

Bottlenecks in Cloud-based DDNN training

MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT.

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

21
slide-32
SLIDE 32

Bottlenecks in Cloud-based DDNN training

DEPLOYMENT-RELATED OVERHEAD

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

22
slide-33
SLIDE 33

Bottlenecks in Cloud-based DDNN training

DEPLOYMENT-RELATED OVERHEAD

  • Transient congestion, or
  • versubscription by

design

  • Cross-rack

communication cost is higher than Intra-rack communication.

Hosts

9 Gbps 4 Gbps

1 2 3 4 5 6 7 8 1 2 4. 7 3 8. 9 4. 7 4 8. 9 4. 7 8. 9 5 8. 9 4. 7 8. 9 8. 9 6 4. 7 9. 4. 7 4. 7 4. 7 7 8. 9 4. 7 9. 8. 9 9. 4. 7

Cluster 1: 1 3 4 5 7 Cluster 2: 2 6 8

Hosts

23
slide-34
SLIDE 34

Parameter Hub Optimizations

CODESIGNING SOFTWARE, HARDWARE WITH CLUSTER CONFIGURATION FOR EFFICIENT CLOUD- BASED DDNN TRAINING

ToR PS 1 ToR PS 2 Worker 2

24
slide-35
SLIDE 35

Network Core

Eliminating framework bottlenecks:

ToR Worker 1 PS 1 ToR PS 2 Worker 2

GPU Data Copy Aggregation Optimization … Network

PHub Optimizations: streamlining DDNN training pipeline

25
slide-36
SLIDE 36

Network Core

Eliminating framework bottlenecks:

ToR Worker 1 PS 1 ToR PS 2 Worker 2

GPU Data Copy Aggregation Optimization … Network

PHub Optimizations: streamlining DDNN training pipeline

26
slide-37
SLIDE 37

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

GRADIENTS CPU

Software Optimizations

27

MEMORY

slide-38
SLIDE 38

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

GRADIENTS CPU

Software Optimizations

27

MEMORY

slide-39
SLIDE 39

Software Optimizations

GRADIENT AGGREGATION AND OPTIMIZATION

Each core reads the input Q from different workers and writes to different locations to the output queue For each input Q, launch a series of threads for

  • aggregation. This is used in
  • MxNet. (Wide Aggregation)

Requires synchronization. Great locality. No synchronization

Sequentially aggregates the same portion of gradients within each queue. (Tall Aggregation) Organize processors into

  • hierarchy. Perform NUMA

aware tree reduction.

NUMA 0 NUMA 1 Great locality. No synchronization Too much coherence and synchronization

28
slide-40
SLIDE 40

Software Optimizations

GRADIENT AGGREGATION AND OPTIMIZATION

Each core reads the input Q from different workers and writes to different locations to the output queue For each input Q, launch a series of threads for

  • aggregation. This is used in
  • MxNet. (Wide Aggregation)

Requires synchronization. Great locality. No synchronization

Sequentially aggregates the same portion of gradients within each queue. (Tall Aggregation) Organize processors into

  • hierarchy. Perform NUMA

aware tree reduction.

NUMA 0 NUMA 1 Great locality. No synchronization Too much coherence and synchronization

28
slide-41
SLIDE 41

Software Optimizations

TALL AGGREGATION AND OPTIMIZATION

  • Chunk a gradient into a series of virtual

gradients deterministically.

  • A virtual gradient is mapped to a

particular core on the server.

Gradient Array for Key 0 from 8 workers Core Mappings

29
slide-42
SLIDE 42

Software Optimizations

TALL AGGREGATION AND OPTIMIZATION

  • Chunk a gradient into a series of virtual

gradients deterministically.

  • A virtual gradient is mapped to a

particular core on the server.

Gradient Array for Key 0 from 8 workers Core Mappings

29
slide-43
SLIDE 43

Software Optimizations

TALL AGGREGATION AND OPTIMIZATION

  • Chunk a gradient into a series of virtual

gradients deterministically.

  • A virtual gradient is mapped to a

particular core on the server.

  • Virtual gradients are transferred

independently.

Gradient Array for Key 0 from 8 workers Core Mappings

29
slide-44
SLIDE 44

Software Optimizations

TALL AGGREGATION AND OPTIMIZATION

  • Chunk a gradient into a series of virtual

gradients deterministically.

  • A virtual gradient is mapped to a

particular core on the server.

  • Virtual gradients are transferred

independently.

  • A chunk is only processed by a single

core : maintaining maximum locality.

Gradient Array for Key 0 from 8 workers Core Mappings Aggregated

29
slide-45
SLIDE 45

Software Optimizations

TALL AGGREGATION AND OPTIMIZATION

When Aggregation is done, PHub:

  • PHub optimizes a chunk with the same

core that aggregates that chunk.

Array for Key 0 from 8 workers Aggregated

30
slide-46
SLIDE 46

Software Optimizations

TALL AGGREGATION AND OPTIMIZATION

When Aggregation is done, PHub:

  • PHub optimizes a chunk with the same

core that aggregates that chunk.

  • FP32-level streaming aggregation and
  • ptimization to hide communication

latency.

Array for Key 0 from 8 workers Aggregated

31
slide-47
SLIDE 47

Software Optimizations

TALL AGGREGATION AND OPTIMIZATION

When Aggregation is done, PHub:

  • PHub optimizes a chunk with the same

core that aggregates that chunk.

  • FP32-level streaming aggregation and
  • ptimization to hide communication

latency.

Array for Key 0 from 8 workers Optimized Aggregated

31
slide-48
SLIDE 48

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic

32
slide-49
SLIDE 49

Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2

Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic

33
slide-50
SLIDE 50

Two-Phase Hierarchical Aggregation

RACK SCALE PARAMETER SERVICE

Cluster Network ToR Rack Worker/PS 1 Worker/PS N ToR Worker/PS 1 Worker/PS 2 CM

34
slide-51
SLIDE 51

Two-Phase Hierarchical Aggregation

RACK SCALE PARAMETER SERVICE

Cluster Network ToR Rack Worker/PS 1 Worker/PS N ToR Worker/PS 1 Worker/PS 2 CM Aggregator PBox

35
slide-52
SLIDE 52

Two-Phase Hierarchical Aggregation

ADAPTING TO THE DATACENTER NETWORK TOPOLOGY

Cluster Network ToR Rack Worker/PS 1 Worker/PS N ToR Worker/PS 1 Worker/PS 2 Aggregator Aggregator

36
slide-53
SLIDE 53

Two-Phase Hierarchical Aggregation

ADAPTING TO THE DATACENTER NETWORK TOPOLOGY

Cluster Network ToR Rack Worker/PS 1 Worker/PS N ToR Worker/PS 1 Worker/PS 2 Aggregator Aggregator

  • 1. Intra-Rack central

aggregation

36
slide-54
SLIDE 54

Two-Phase Hierarchical Aggregation

ADAPTING TO THE DATACENTER NETWORK TOPOLOGY

Cluster Network ToR Rack Worker/PS 1 Worker/PS N ToR Worker/PS 1 Worker/PS 2 Aggregator Aggregator

  • 1. Intra-Rack central

aggregation

  • 2. Inter-Rack

aggregation N times traffic reduction!

36
slide-55
SLIDE 55

Efficient DDNN Training in Commercial Cloud

ACTIVE TOPOLOGY PROBING

VMs Azure/EC2

slide-56
SLIDE 56

DPDK-based latency Probe

Efficient DDNN Training in Commercial Cloud

ACTIVE TOPOLOGY PROBING

VMs Azure/EC2

slide-57
SLIDE 57

DPDK-based latency Probe

Distance Matrix

Efficient DDNN Training in Commercial Cloud

ACTIVE TOPOLOGY PROBING

VMs Azure/EC2

slide-58
SLIDE 58

DPDK-based latency Probe

Distance Matrix Clustering Algorithms

Efficient DDNN Training in Commercial Cloud

ACTIVE TOPOLOGY PROBING

VMs Azure/EC2

slide-59
SLIDE 59

DPDK-based latency Probe

Distance Matrix Clustering Algorithms Inferred Network Topology*

Efficient DDNN Training in Commercial Cloud

ACTIVE TOPOLOGY PROBING

VMs Azure/EC2

slide-60
SLIDE 60

DPDK-based latency Probe

Distance Matrix Clustering Algorithms Inferred Network Topology* Automagic Schedule Generation

Efficient DDNN Training in Commercial Cloud

ACTIVE TOPOLOGY PROBING

VMs Azure/EC2

slide-61
SLIDE 61

Hierarchical Reduction Plan DPDK-based latency Probe

Distance Matrix Clustering Algorithms Inferred Network Topology* Automagic Schedule Generation

Efficient DDNN Training in Commercial Cloud

ACTIVE TOPOLOGY PROBING

VMs Azure/EC2

slide-62
SLIDE 62

Hierarchical Reduction Plan DPDK-based latency Probe

Distance Matrix Clustering Algorithms Inferred Network Topology* Automagic Schedule Generation

Efficient DDNN Training in Commercial Cloud

ACTIVE TOPOLOGY PROBING

VMs Azure/EC2

slide-63
SLIDE 63

1 1 2 2

Azure (Standard NC6) EC2 (P3.2xlarge)

1.4 2.4 VS Facebook Gloo

Performance in commercial cloud with PHub

38

Windows Azure and Amazon EC2. 32 instances. Up to 10 Gbps. Standard_NC6: Nvidia

  • K80. Batch Size = 512. P3.2xLarge: Nvidia V100. Batch Size = 512. Facebook Caffe2/
  • Pytorch. ResNet 50.

3 5 8 10

Azure (Standard NC6) EC2 (P3.2xlarge)

1.8 9.6 VS Ring Reduction

slide-64
SLIDE 64

Framework Integration

Support for Mxnet/Pytorch/Caffe2.

var pHub = std::make_shared<PHub>(cfg.redisIp, nMap, keySize, appAddrs, cntr, sizeof(float), cfg.rank, plp); pHub->ToggleUseSchedule(pSchedule); pHub->Reduce();

slide-65
SLIDE 65

Optimization

Active Topology Probing

Your Cloud

Groundwork for bringing TVM to the distributed world for training and inference, on commercial cloud, or in your own cluster.

AutoTVM AutoVTA High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC Hardware Fleet

slide-66
SLIDE 66
slide-67
SLIDE 67
slide-68
SLIDE 68

Hardware Parameter Hub

slide-69
SLIDE 69

Hardware Parameter Hub

slide-70
SLIDE 70

Hardware Parameter Hub

Balanced computation and communication resource.

  • 10 ConnectX-3 Card
  • 560+Gbps Network BW
  • 800Gbps PCIe
  • Fully supported by Software

Parameter Hub

slide-71
SLIDE 71

Hardware Parameter Hub

35GB/s aggregation throughput. Supports 100+ ResNet-50 training nodes with a single machine.

Gloo HD Gloo Ring PS-Lite PHub SW

2 4.5 5 7

slide-72
SLIDE 72

Hardware Parameter Hub

ResNet-50. See paper for detailed estimates. Better training throughput/$.

slide-73
SLIDE 73

Hardware Parameter Hub

ResNet-50. See paper for detailed estimates. Better training throughput/$.

02 04 06 08 10 12 14 16 18 20 22 25 %