spcl.inf.ethz.ch @spcl_eth
- T. BEN-NUN, T. HOEFLER
Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
https://www.arxiv.org/abs/1802.09941
Demystifying Parallel and Distributed Deep Learning: An In-Depth - - PowerPoint PPT Presentation
spcl.inf.ethz.ch @spcl_eth T. B EN -N UN , T. H OEFLER Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis https://www.arxiv.org/abs/1802.09941 spcl.inf.ethz.ch @spcl_eth What is Deep Learning good for?
spcl.inf.ethz.ch @spcl_eth
Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
https://www.arxiv.org/abs/1802.09941
spcl.inf.ethz.ch @spcl_eth
What is Deep Learning good for? 2012 2017 1989
Digit Recognition Image Captioning Object Classification Segmentation
2013 2014 2016
Gameplay AI Translation Neural Computers
A very promising area of research!
23 papers per day! number of papers per year
spcl.inf.ethz.ch @spcl_eth
How does Deep Learning work?
Canziani et al. 2017 Number of users 0.8 bn
0.54 0.28 0.02 0.07 0.33 0.04 0.02 Cat Dog Airplane Truck Horse Bicycle
f(x) layer-wise weight update ▪ ImageNet (1k): 180 GB ▪ ImageNet (22k): A few TB ▪ Industry: Much larger ▪ 100-200 layers deep ▪ ~100M-2B parameters ▪ 0.1-8 GiB parameter storage ▪ 10-22k labels ▪ growing (e.g., face recognition) ▪ weeks to train
1.00 0.00 0.00 0.00 0.00 0.00 0.00 Cat Dog Airplane Truck Horse Bicycle
Deep Learning is Supercomputing!
spcl.inf.ethz.ch @spcl_eth
5
A brief theory of supervised deep learning
1.00 0.00 0.00 0.00 0.00 0.00 0.00 Cat Dog Airplane Truck Horse Bicycle
labeled samples 𝑦 ∈ 𝑌 ⊂ 𝑔 𝑦 : 𝑌 → 𝑍 label domain 𝑍 network structure (fixed) weights 𝑥 (learned) 𝑥∗ = argmin𝑥∈ℝ𝑒 𝔽𝑦~ ℓ 𝑥, 𝑦 true label 𝑚(𝑦) ℓ0−1 𝑥, 𝑦 = ቊ0 𝑔 𝑦 = 𝑚(𝑦) 1 𝑔 𝑦 ≠ 𝑚(𝑦)
0.54 0.28 0.02 0.07 0.33 0.04 0.02 Cat Dog Airplane Truck Horse Bicycle
𝑔(𝑦) layer-wise weight update 𝑔 𝑦 = 𝑔
𝑜 𝑔 𝑜−1 𝑔 𝑜−2 … 𝑔 1 𝑦 …
convolution 1 convolution 2 convolution 3 pooling fully connected 𝑔
1 𝑦
𝑔
2 𝑔 1 𝑦
𝑔(𝑦) …
ℓ𝑑𝑓 𝑥, 𝑦 = −
𝑗
𝑚 𝑦 𝑗 ⋅ log 𝑓𝑔 𝑦 𝑗 σ𝑙 𝑓𝑔 𝑦 𝑙
ℓ𝑡𝑟 𝑥, 𝑦 = 𝑔 𝑦 − 𝑚 𝑦
2
spcl.inf.ethz.ch @spcl_eth
6
Stochastic Gradient Descent
convolution 1 convolution 2 𝑔
1(𝑦)
𝑔
2 𝑔 1 𝑦
▪ Layer storage = 𝑥𝑚 + 𝑔
𝑚 𝑝𝑚−1
+ 𝛼𝑥𝑚 + 𝛼𝑝𝑚
𝑥∗ = argmin𝑥∈ℝ𝑒 𝔽𝑦~ ℓ 𝑥, 𝑦 convolution 3 pooling fully connected 𝑔(𝑦) …
spcl.inf.ethz.ch @spcl_eth
7
Trends in deep learning: hardware and multi-node
The field is moving fast – trying everything imaginable – survey results from 227 papers in the area of parallel deep learning Hardware used Shared vs. distributed memory
Deep Learning is largely on distributed memory today!
spcl.inf.ethz.ch @spcl_eth
8
Trends in distributed deep learning: node count and communication
Deep Learning research is converging to MPI!
The field is moving fast – trying everything imaginable – survey results from 227 papers in the area of parallel deep learning
spcl.inf.ethz.ch @spcl_eth
9
Minibatch Stochastic Gradient Descent (SGD)
0.54 0.28 0.02 0.07 0.03 0.04 0.02 Cat Dog Airplane Truck Horse Bicycle 1.00 0.00 0.00 0.00 0.00 0.00 0.00 Cat Dog Airplane Truck Horse Bicycle
spcl.inf.ethz.ch @spcl_eth
TH, D. Moor: Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations, JSFI’14 10
A primer of relevant parallelism and communication theory
Work W = 39 Depth D = 7
Average parallelism =
𝑋 𝐸
Parallel Reductions for Parameter Updates
Tree
𝑈 = 2𝑀 log2 𝑄 + 2𝛿𝑛𝐻 log2 𝑄 𝑈 = 𝑀 log2 𝑄 + 𝛿𝑛𝐻 log2 𝑄
Butterfly Pipeline
𝑈 = 2𝑀(𝑄 − 1) + 2𝛿𝑛𝐻(𝑄 − 1)/𝑄 𝑈 = 2𝑀 log2𝑄 + 2𝛿𝑛𝐻(𝑄 − 1)/𝑄
RedScat+Gat Small vectors Large vectors Lower bound: 𝑈 ≥ 𝑀 log2 𝑄 + 2𝛿𝑛𝐻 𝑄 − 1 /𝑄
spcl.inf.ethz.ch @spcl_eth
11
GoogLeNet in more detail
▪ ~6.8M parameters ▪ 22 layers deep
spcl.inf.ethz.ch @spcl_eth
𝑔
𝑚 𝑦
𝛼𝑥 𝛼𝑝𝑚 𝑔
𝑚 𝑦
𝛼𝑥 𝛼𝑝𝑚 𝑔
𝑚 𝑦
𝛼𝑥 𝛼𝑝𝑚 𝑔
𝑚 𝑦
𝛼𝑥 𝛼𝑝𝑚 𝑔
𝑚 𝑦
𝛼𝑥 𝛼𝑝𝑚
13
Parallelism in the different layer types
4 1 9 8 5 9 9 8 0 7 3 4 2 6 3 1
1 -1 0 0.1 -2 0 3 4 1.1
*
21.9 59.3 53.9 43.9=
21.9 59.3 53.9 43.959.3 53.9 15.3 53.1
W is linear and D logarithmic – large average parallelism
spcl.inf.ethz.ch @spcl_eth
14
Computing fully connected layers 𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2 𝑐1 𝑐2
𝑥3,2 𝑥1,2 𝑥1,1
𝑦1 𝑦2
𝑥3,1 𝑥2,1 𝑥2,2
𝜏 σ𝑥𝑗,1𝑦𝑗 + 𝑐1
𝑦3
𝜏 σ𝑥𝑗,2𝑦𝑗 + 𝑐2
𝑦1,1 𝑦1,2 𝑦1,3 1 𝑦2,1 𝑦2,2 𝑦2,3 1 ⋮ ⋮ ⋮ ⋮ 𝑦𝑂,1 𝑦𝑂,2 𝑦𝑂,3 1
𝑔
𝑚 𝑦
𝛼𝑥 𝛼𝑝𝑚
spcl.inf.ethz.ch @spcl_eth
Indirect
15
Computing convolutional layers Direct
𝑥 ℱ ℱ ℱ−1 =
×
ෝ 𝑥 FFT
4 1 9 8 5 9 9 8 0 7 3 4 2 6 3 1
1
0.1 -2 3 4 1.1
*
21.9 59.3 53.9 43.9
12 9.6 15.3 25.8 14 0.4 7.1 52.1 53.1
=
Winograd
Convolutional Neural Networks, ICLR’17 Workshop
Direct im2col
spcl.inf.ethz.ch @spcl_eth
▪ Parameters can be distributed across processors ▪ Mini-batch has to be copied to all processors ▪ Backpropagation requires all-to-all communication every layer
16
Model parallelism
…
1 3 U.A. Muller and A. Gunzinger: Neural Net Simulation on Parallel Computers, IEEE Int’l Conf. on Neural Networks 1994
spcl.inf.ethz.ch @spcl_eth
17
Pipeline parallelism
▪ Layers/parameters can be distributed across processors ▪ Sparse communication pattern (only pipeline stages) ▪ Mini-batch has to be copied through all processors
…
spcl.inf.ethz.ch @spcl_eth
18
Data parallelism
▪ Simple and efficient solution, easy to implement ▪ Duplicate parameters at all processors
… … …
spcl.inf.ethz.ch @spcl_eth
19
Hybrid parallelism
▪ Layers/parameters can be distributed across processors ▪ Can distribute minibatch ▪ Often specific to layer-types (e.g., distribute fc layers but handle conv layers data-parallel)
▪ Enables arbitrary combinations of data, model, and pipeline parallelism – very powerful! Model Parallelism Data Parallelism Layer (pipeline) Parallelism … … …
spcl.inf.ethz.ch @spcl_eth
20
Updating parameters in distributed data parallelism
Central Decentral
parameter server (sharded) 𝑥’ = 𝑣(𝑥, 𝛼𝑥)
𝑥 𝛼𝑥
Training Agent Training Agent Training Agent Training Agent Training Agent Training Agent Training Agent Training Agent
collective allreduce of 𝒙
𝑈 = 2𝑀 log2 𝑄 + 2𝛿𝑛𝐻(𝑄 − 1)/𝑄 𝑈 = 2𝑀 + 2𝑄 𝛿𝑛/𝑡 𝐻
Hierarchical Parameter Server
Runtime Tradeoff in Distributed Deep Learning: A Systematic
Adaptive Minibatch Size
Learning Rate, Increase the Batch Size, arXiv 2017
spcl.inf.ethz.ch @spcl_eth
▪ Started with Hogwild! [Niu et al. 2011] – shared memory, by chance ▪ DistBelief [Dean et al. 2012] moved the idea to distributed ▪ Trades off “statistical performance” for “hardware performance”
21
Parameter (and Model) consistency - centralized
parameter server (sharded) 𝑥’ = 𝑣(𝑥, 𝛼𝑥)
𝑥 𝛼𝑥
Training Agent Training Agent Training Agent Training Agent
Synchronous Stale Synchronous / Bounded Asynchronous Asynchronous
𝑥 1 𝑥 1
Time
Parameter Server
Synchronization
𝑥 2 𝑥 2
Agent 1 Agent m
𝑥 𝑈 𝑥 0
…
Sync.
Time
Parameter Server
Agent 1 Agent m
𝑥 𝑈 𝑥 0
…
𝑥 1,𝑛 𝑥 2,𝑛 𝑥 2,1 𝑥 1,1 𝑥 3,1 𝑥 3,𝑛
Time
Agent 1 Agent m
𝑥 1,1
𝑥 1,𝑛 𝑥 2,𝑛
𝑥 2,1 𝑥 3,1 𝑥 4,1
Parameter Server
𝑥 0 𝑥 𝑈
…
Sync.
▪ Parameter exchange frequency can be controlled, while still attaining convergence:
spcl.inf.ethz.ch @spcl_eth
▪ Parameter exchange frequency can be controlled, while still attaining convergence: ▪ May also consider limited/slower distribution – gossip [Jin et al. 2016]
22
Synchronous Stale Synchronous / Bounded Asynchronous Asynchronous
Training Agent Training Agent Training Agent Training Agent
collective allreduce of 𝒙
Time
All- Reduce
Agent 1 Agent m
… … . . . Merge
𝑥 1,1
𝑥 1,𝑛 𝑥 2,𝑛
𝑥(0) 𝑥(𝑈)
𝑥 2,1 𝑥 3,1 𝑥 4,1
All- Reduce 𝑥 1
Time 𝑥(0)
All- Reduce 𝑥 𝑈 𝑥 2 𝑥 2
Agent 1 Agent m
𝑥 1 𝑥 𝑈
… …
All- Reduce
Time
Agent 1 Agent m
𝑥 1,𝑛 𝑥 2,𝑛 𝑥 2,1 𝑥 1,1 𝑥 3,1 𝑥 3,𝑛
Agent r Agent k
𝑥 1,𝑠 𝑥 2,𝑠 𝑥 3,𝑠 𝑥 4,𝑠 𝑥 5,𝑠 𝑥 1,𝑙 𝑥 2,𝑙 𝑥 3,𝑙
Peter H. Jin et al., “How to scale distributed deep learning?”, NIPS MLSystems 2016
Parameter (and Model) consistency - decentralized
spcl.inf.ethz.ch @spcl_eth
23
Parameter consistency in deep learning
Inconsistent Ensemble Learning Synchronous SGD Consistent Stale-Synchronous SGD Model Averaging (e.g., elastic) Asynchronous SGD (HOGWILD!)
𝑥 𝑢+1,𝑗 = 𝑥 𝑢,𝑗 − 𝜃𝛼𝑥 𝑢,𝑗 − 𝛽 𝑥 𝑢,𝑗 − 𝑥𝑢 𝑥𝑢+1 = 1 − 𝛾 𝑥𝑢 + 𝛾 𝑛
𝑗=1 𝑛
𝑥 𝑢,𝑗
𝑥 1,1
Time
Parameter Server
Agent 1 Agent m
𝑥 𝑈 𝑥 0
…
Sync.
𝑥 2,1 𝑥 3,1 𝑥 4,1 𝑥 5,1 𝑥 6,1 𝑥 1,𝑛 𝑥 2,𝑛 𝑥 3,𝑛 𝑥 4,𝑛 𝑥 5,𝑛 𝑥 6,𝑛
Elastic Average
Using physical forces between different versions of 𝑥:
spcl.inf.ethz.ch @spcl_eth
24
Parameter consistency in deep learning
Inconsistent Ensemble Learning Synchronous SGD Consistent Stale-Synchronous SGD Model Averaging (e.g., elastic) Asynchronous SGD (HOGWILD!)
Avg.
0.54 0.28 0.02 0.07 0.33 0.04 0.02 Cat Dog Airplane Truck Horse Bicycle
spcl.inf.ethz.ch @spcl_eth
▪ Different options how to optimize updates
▪ Send 𝛼𝑥, receive 𝑥 ▪ Send FC factors (𝑝𝑚−1, 𝑝𝑚), compute 𝛼𝑥 on parameter server Broadcast factors to not receive full w ▪ Use lossy compression when sending, accumulate error locally!
▪ Quantization
▪ Quantize weight updates and potentially weights ▪ Main trick is stochastic rounding [1] – expectation is more accurate Enables low precision (half, quarter) to become standard ▪ TernGrad - ternary weights [2], 1-bit SGD [3], …
▪ Sparsification
▪ Do not send small weight updates or only send top-k [4] Accumulate them locally
25
Communication optimizations
parameter server (sharded) 𝑥’ = 𝑣(𝑥, 𝛼𝑥)
𝑥 𝛼𝑥
Training Agent Training Agent Training Agent Training Agent
[1] S. Gupta et al. Deep Learning with Limited Numerical Precision, ICML’15 [2] F. Li and B. Liu. Ternary Weight Networks, arXiv 2016 [3] F. Seide et al. 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs, In Interspeech 2014 [4] C. Renggli et al. SparCML: High-Performance Sparse Communication for Machine Learning, arXiv 2018 source: ai.intel.com
spcl.inf.ethz.ch @spcl_eth
26
SparCML – Quantified sparse allreduce for decentral updates
𝛼𝑥1 𝛼𝑥2 𝛼𝑥3 𝛼𝑥4
MNIST test accuracy
spcl.inf.ethz.ch @spcl_eth
27
Hyperparameter and Architecture search
Reinforcement Learning [1] Evolutionary Algorithms [4]
▪ Meta-optimization of hyper-parameters (momentum) and DNN architecture
▪ Using Reinforcement Learning [1] (explore/exploit different configurations) ▪ Genetic Algorithms with modified (specialized) mutations [2] ▪ Particle Swarm Optimization [3] and other meta-heuristics
[1] M. Jaderberg et al.: Population Based Training of Neural Networks, arXiv 2017 [2] E. Real et al.: Regularized Evolution for Image Classifier Architecture Search, arXiv 2018 [3] P. R. Lorenzo et al.: Hyper-parameter Selection in Deep Neural Networks Using Parallel Particle Swarm Optimization, GECCO’17 [4] H. Liu et al.: Hierarchical Representations for Efficient Architecture Search, ICLR’18
spcl.inf.ethz.ch @spcl_eth
▪ Full details in the survey (60 pages)
▪ Detailed analysis
▪ Additional content:
▪ Unsupervised (GAN/autoencoders) ▪ Recurrent (RNN/LSTM)
▪ Call to action to the HPC and ML/DL communities to join forces!
▪ It’s already happening on the tool basis ▪ Need more joint events!
28
Outlook
Deadline soon! https://www.arxiv.org/abs/1802.09941