Scalable Distributed Training with Parameter Hub: a whirlwind tour
Scalable Distributed Training with Parameter Hub: a whirlwind tour - - PowerPoint PPT Presentation
Scalable Distributed Training with Parameter Hub: a whirlwind tour - - PowerPoint PPT Presentation
Scalable Distributed Training with Parameter Hub: a whirlwind tour TVM Stack Optimization High-Level Differentiable IR AutoTVM Tensor Expression IR LLVM, CUDA, Metal VTA AutoVTA Edge Cloud Hardware ASIC FPGA FPGA Fleet
TVM Stack
Optimization AutoTVM AutoVTA High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC Hardware Fleet
Optimization AutoTVM AutoVTA High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC Hardware Fleet
Active Topology Probing
Your CloudGroundwork for bringing TVM to the distributed world for training and inference, on commercial cloud, or in your own cluster.
Parameter Hub
Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 4Optimized, topology-aware and dynamic mechanism for inter-machine communication
Parameter Hub
Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 4* In the cloud-based training context
Optimized, topology-aware and dynamic mechanism for inter-machine communication
Deep Learning constitutes an important workload in cloud today. Major cloud providers all have an ecosystem for cloud learning.
5Deep Learning constitutes an important workload in cloud today. Major cloud providers all have an ecosystem for cloud learning.
5Server demand for DL inference across data centers nearly quadrupled in less than 2 years. Source: Facebook
Server demand for DL inference across data centers nearly quadrupled in less than 2 years. Source: Facebook
EC2 reclaims your GPU instances as they run out of capacity
EC2 reclaims your GPU instances as they run out of capacity
Distributed Training
INDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE
F1 B1 A1 O1 F1 B1 Parameter Server Worker 1 Worker 2 F2 B2 F2 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization A2 O2 F3 F3 B2 B2 Time
Worker Parameter Server
8Distributed Training
INDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE
F1 B1 A1 O1 F1 B1 Parameter Sever Worker 1 Worker 2 F2 B2 F2 B2 (F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization A2 O2 F3 F3 B2 B2 Time
Worker Parameter Server
9Distributed Training Today
IN THE CONTEXT OF THE CLOUD
Network Core ToR Machine with GPUs Machine ToR Machine with GPUs Machine
10Distributed Training Today
FORWARD AND BACKWARD PASSES IN WORKER
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
11Distributed Training Today
AGGREGATION AND OPTIMIZATION IN PS
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
12Distributed training is communication bound
ResNet 269
Seconds
0.45 0.9 1.35 1.8 GRID 520 K80 M60 V100
- Problem gets worse
- ver time: shifting
bottleneck.
- With modern GPUs
most of the time is spent
- n communication.
- Making GPUs faster will
do little to increase throughput
- Wasting compute
resources.
132012 2014 2015 2017
GPU and Network active GPU idle, waiting on network
Distributed training is communication bound
Inception V3
AlexNet GoogleNet ResNet 269
Bottlenecks in DDNN training
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT.
15Bottlenecks in DDNN training
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
FRAMEWORK BOTTLENECKS
16Bottlenecks in DDNN training
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
FRAMEWORK BOTTLENECKS
GPU
Training Framework …
Network
16Bottlenecks in DDNN training
FRAMEWORK BOTTLENECKS
17Bottlenecks in DDNN training
FRAMEWORK BOTTLENECKS
ResNet 269 Inception GoogleNet AlexNet
Seconds 0.4 0.8 1.2 1.6
Compute Data Copy and Communication Aggregator Optimizer Synchronization and other Overheads
17Bottlenecks in DDNN training
MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT.
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
18Bottlenecks in DDNN training
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
BANDWIDTH BOTTLENECK
19Bottlenecks in Cloud-based DDNN training
INSUFFICIENT BANDWIDTH
2025 Gbps 10 Gbps 1000 Gbps 1300 Gbps Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet
Bottlenecks in Cloud-based DDNN training
INSUFFICIENT BANDWIDTH
2025 Gbps 10 Gbps Cloud Bandwidth 1000 Gbps 1300 Gbps Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet
Bottlenecks in Cloud-based DDNN training
INSUFFICIENT BANDWIDTH
2025 Gbps 10 Gbps Cloud Bandwidth 1000 Gbps 1300 Gbps GoogleNet / Inception: 40 Gbps Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet
Bottlenecks in Cloud-based DDNN training
INSUFFICIENT BANDWIDTH
2025 Gbps 10 Gbps Cloud Bandwidth 1000 Gbps 1300 Gbps GoogleNet / Inception: 40 Gbps ResNet: 100 Gbps Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet
Bottlenecks in Cloud-based DDNN training
INSUFFICIENT BANDWIDTH
2025 Gbps 10 Gbps Cloud Bandwidth 1000 Gbps 1300 Gbps GoogleNet / Inception: 40 Gbps ResNet: 100 Gbps AlexNet: 1200 Gbps Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation? 8 workers, GTX 1080 Ti, central parameter servers. MxNet
Bottlenecks in Cloud-based DDNN training
MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT.
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
21Bottlenecks in Cloud-based DDNN training
DEPLOYMENT-RELATED OVERHEAD
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
22Bottlenecks in Cloud-based DDNN training
DEPLOYMENT-RELATED OVERHEAD
- Transient congestion, or
- versubscription by
design
- Cross-rack
communication cost is higher than Intra-rack communication.
Hosts
9 Gbps 4 Gbps
1 2 3 4 5 6 7 8 1 2 4. 7 3 8. 9 4. 7 4 8. 9 4. 7 8. 9 5 8. 9 4. 7 8. 9 8. 9 6 4. 7 9. 4. 7 4. 7 4. 7 7 8. 9 4. 7 9. 8. 9 9. 4. 7
Cluster 1: 1 3 4 5 7 Cluster 2: 2 6 8
Hosts
23Parameter Hub Optimizations
CODESIGNING SOFTWARE, HARDWARE WITH CLUSTER CONFIGURATION FOR EFFICIENT CLOUD- BASED DDNN TRAINING
ToR PS 1 ToR PS 2 Worker 2
24Network Core
Eliminating framework bottlenecks:
ToR Worker 1 PS 1 ToR PS 2 Worker 2
GPU Data Copy Aggregation Optimization … Network
PHub Optimizations: streamlining DDNN training pipeline
25Network Core
Eliminating framework bottlenecks:
ToR Worker 1 PS 1 ToR PS 2 Worker 2
GPU Data Copy Aggregation Optimization … Network
PHub Optimizations: streamlining DDNN training pipeline
26Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
GRADIENTS CPU
Software Optimizations
27MEMORY
Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
GRADIENTS CPU
Software Optimizations
27MEMORY
Software Optimizations
GRADIENT AGGREGATION AND OPTIMIZATION
Each core reads the input Q from different workers and writes to different locations to the output queue For each input Q, launch a series of threads for
- aggregation. This is used in
- MxNet. (Wide Aggregation)
Requires synchronization. Great locality. No synchronization
Sequentially aggregates the same portion of gradients within each queue. (Tall Aggregation) Organize processors into
- hierarchy. Perform NUMA
aware tree reduction.
NUMA 0 NUMA 1 Great locality. No synchronization Too much coherence and synchronization
28Software Optimizations
GRADIENT AGGREGATION AND OPTIMIZATION
Each core reads the input Q from different workers and writes to different locations to the output queue For each input Q, launch a series of threads for
- aggregation. This is used in
- MxNet. (Wide Aggregation)
Requires synchronization. Great locality. No synchronization
Sequentially aggregates the same portion of gradients within each queue. (Tall Aggregation) Organize processors into
- hierarchy. Perform NUMA
aware tree reduction.
NUMA 0 NUMA 1 Great locality. No synchronization Too much coherence and synchronization
28Software Optimizations
TALL AGGREGATION AND OPTIMIZATION
- Chunk a gradient into a series of virtual
gradients deterministically.
- A virtual gradient is mapped to a
particular core on the server.
Gradient Array for Key 0 from 8 workers Core Mappings
29Software Optimizations
TALL AGGREGATION AND OPTIMIZATION
- Chunk a gradient into a series of virtual
gradients deterministically.
- A virtual gradient is mapped to a
particular core on the server.
Gradient Array for Key 0 from 8 workers Core Mappings
29Software Optimizations
TALL AGGREGATION AND OPTIMIZATION
- Chunk a gradient into a series of virtual
gradients deterministically.
- A virtual gradient is mapped to a
particular core on the server.
- Virtual gradients are transferred
independently.
Gradient Array for Key 0 from 8 workers Core Mappings
29Software Optimizations
TALL AGGREGATION AND OPTIMIZATION
- Chunk a gradient into a series of virtual
gradients deterministically.
- A virtual gradient is mapped to a
particular core on the server.
- Virtual gradients are transferred
independently.
- A chunk is only processed by a single
core : maintaining maximum locality.
Gradient Array for Key 0 from 8 workers Core Mappings Aggregated
29Software Optimizations
TALL AGGREGATION AND OPTIMIZATION
When Aggregation is done, PHub:
- PHub optimizes a chunk with the same
core that aggregates that chunk.
Array for Key 0 from 8 workers Aggregated
30Software Optimizations
TALL AGGREGATION AND OPTIMIZATION
When Aggregation is done, PHub:
- PHub optimizes a chunk with the same
core that aggregates that chunk.
- FP32-level streaming aggregation and
- ptimization to hide communication
latency.
Array for Key 0 from 8 workers Aggregated
31Software Optimizations
TALL AGGREGATION AND OPTIMIZATION
When Aggregation is done, PHub:
- PHub optimizes a chunk with the same
core that aggregates that chunk.
- FP32-level streaming aggregation and
- ptimization to hide communication
latency.
Array for Key 0 from 8 workers Optimized Aggregated
31Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic
32Network Core ToR Worker 1 PS 1 ToR PS 2 Worker 2
Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic
33Two-Phase Hierarchical Aggregation
RACK SCALE PARAMETER SERVICE
Cluster Network ToR Rack Worker/PS 1 Worker/PS N ToR Worker/PS 1 Worker/PS 2 CM
34Two-Phase Hierarchical Aggregation
RACK SCALE PARAMETER SERVICE
Cluster Network ToR Rack Worker/PS 1 Worker/PS N ToR Worker/PS 1 Worker/PS 2 CM Aggregator PBox
35Two-Phase Hierarchical Aggregation
ADAPTING TO THE DATACENTER NETWORK TOPOLOGY
Cluster Network ToR Rack Worker/PS 1 Worker/PS N ToR Worker/PS 1 Worker/PS 2 Aggregator Aggregator
36Two-Phase Hierarchical Aggregation
ADAPTING TO THE DATACENTER NETWORK TOPOLOGY
Cluster Network ToR Rack Worker/PS 1 Worker/PS N ToR Worker/PS 1 Worker/PS 2 Aggregator Aggregator
- 1. Intra-Rack central
aggregation
36Two-Phase Hierarchical Aggregation
ADAPTING TO THE DATACENTER NETWORK TOPOLOGY
Cluster Network ToR Rack Worker/PS 1 Worker/PS N ToR Worker/PS 1 Worker/PS 2 Aggregator Aggregator
- 1. Intra-Rack central
aggregation
- 2. Inter-Rack
aggregation N times traffic reduction!
36Efficient DDNN Training in Commercial Cloud
ACTIVE TOPOLOGY PROBING
VMs Azure/EC2
DPDK-based latency Probe
Efficient DDNN Training in Commercial Cloud
ACTIVE TOPOLOGY PROBING
VMs Azure/EC2
DPDK-based latency Probe
…
Distance Matrix
Efficient DDNN Training in Commercial Cloud
ACTIVE TOPOLOGY PROBING
VMs Azure/EC2
DPDK-based latency Probe
…
Distance Matrix Clustering Algorithms
Efficient DDNN Training in Commercial Cloud
ACTIVE TOPOLOGY PROBING
VMs Azure/EC2
DPDK-based latency Probe
…
Distance Matrix Clustering Algorithms Inferred Network Topology*
Efficient DDNN Training in Commercial Cloud
ACTIVE TOPOLOGY PROBING
VMs Azure/EC2
DPDK-based latency Probe
…
Distance Matrix Clustering Algorithms Inferred Network Topology* Automagic Schedule Generation
Efficient DDNN Training in Commercial Cloud
ACTIVE TOPOLOGY PROBING
VMs Azure/EC2
Hierarchical Reduction Plan DPDK-based latency Probe
…
Distance Matrix Clustering Algorithms Inferred Network Topology* Automagic Schedule Generation
Efficient DDNN Training in Commercial Cloud
ACTIVE TOPOLOGY PROBING
VMs Azure/EC2
Hierarchical Reduction Plan DPDK-based latency Probe
…
Distance Matrix Clustering Algorithms Inferred Network Topology* Automagic Schedule Generation
Efficient DDNN Training in Commercial Cloud
ACTIVE TOPOLOGY PROBING
VMs Azure/EC2
1 1 2 2
Azure (Standard NC6) EC2 (P3.2xlarge)
1.4 2.4 VS Facebook Gloo
Performance in commercial cloud with PHub
38Windows Azure and Amazon EC2. 32 instances. Up to 10 Gbps. Standard_NC6: Nvidia
- K80. Batch Size = 512. P3.2xLarge: Nvidia V100. Batch Size = 512. Facebook Caffe2/
- Pytorch. ResNet 50.
3 5 8 10
Azure (Standard NC6) EC2 (P3.2xlarge)
1.8 9.6 VS Ring Reduction
Framework Integration
Support for Mxnet/Pytorch/Caffe2.
var pHub = std::make_shared<PHub>(cfg.redisIp, nMap, keySize, appAddrs, cntr, sizeof(float), cfg.rank, plp); pHub->ToggleUseSchedule(pSchedule); pHub->Reduce();
Optimization
Active Topology Probing
Your CloudGroundwork for bringing TVM to the distributed world for training and inference, on commercial cloud, or in your own cluster.
AutoTVM AutoVTA High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC Hardware Fleet
Hardware Parameter Hub
Hardware Parameter Hub
Hardware Parameter Hub
Balanced computation and communication resource.
- 10 ConnectX-3 Card
- 560+Gbps Network BW
- 800Gbps PCIe
- Fully supported by Software
Parameter Hub
Hardware Parameter Hub
35GB/s aggregation throughput. Supports 100+ ResNet-50 training nodes with a single machine.
Gloo HD Gloo Ring PS-Lite PHub SW
2 4.5 5 7
Hardware Parameter Hub
ResNet-50. See paper for detailed estimates. Better training throughput/$.
Hardware Parameter Hub
ResNet-50. See paper for detailed estimates. Better training throughput/$.