Distributed Learning Amir H. Payberah payberah@kth.se 10/12/2019 - - PowerPoint PPT Presentation

distributed learning
SMART_READER_LITE
LIVE PREVIEW

Distributed Learning Amir H. Payberah payberah@kth.se 10/12/2019 - - PowerPoint PPT Presentation

Distributed Learning Amir H. Payberah payberah@kth.se 10/12/2019 The Course Web Page https://id2223kth.github.io 1 / 81 Where Are We? 2 / 81 Where Are We? 3 / 81 A few Words about CPU and GPU 4 / 81


slide-1
SLIDE 1

Distributed Learning

Amir H. Payberah

payberah@kth.se 10/12/2019

slide-2
SLIDE 2

The Course Web Page

https://id2223kth.github.io

1 / 81

slide-3
SLIDE 3

Where Are We?

2 / 81

slide-4
SLIDE 4

Where Are We?

3 / 81

slide-5
SLIDE 5

A few Words about CPU and GPU

4 / 81

slide-6
SLIDE 6

[https://www.tripsavvy.com/how-to-get-from-copenhagen-to-stockholm-1626275]

5 / 81

slide-7
SLIDE 7

Ferrari or Truck?

6 / 81

slide-8
SLIDE 8

Ferrari or Truck?

◮ Pick up your partner? 7 / 81

slide-9
SLIDE 9

Ferrari or Truck?

◮ Pick up your partner? 7 / 81

slide-10
SLIDE 10

Ferrari or Truck?

◮ Pick up your partner? ◮ Moving the furniture? 7 / 81

slide-11
SLIDE 11

Ferrari or Truck?

◮ Pick up your partner? ◮ Moving the furniture? 7 / 81

slide-12
SLIDE 12

CPU vs GPU

8 / 81

slide-13
SLIDE 13

CPU vs GPU

9 / 81

slide-14
SLIDE 14

10 / 81

slide-15
SLIDE 15

Do We Need GPU for Deep Learning?

11 / 81

slide-16
SLIDE 16

◮ Which components of a DNN would require intense hardware resource? 12 / 81

slide-17
SLIDE 17

◮ Which components of a DNN would require intense hardware resource? ◮ A few candidates are: 12 / 81

slide-18
SLIDE 18

◮ Which components of a DNN would require intense hardware resource? ◮ A few candidates are:

  • Preprocessing input data

12 / 81

slide-19
SLIDE 19

◮ Which components of a DNN would require intense hardware resource? ◮ A few candidates are:

  • Preprocessing input data
  • Training the model

12 / 81

slide-20
SLIDE 20

◮ Which components of a DNN would require intense hardware resource? ◮ A few candidates are:

  • Preprocessing input data
  • Training the model
  • Storing the trained model

12 / 81

slide-21
SLIDE 21

◮ Which components of a DNN would require intense hardware resource? ◮ A few candidates are:

  • Preprocessing input data
  • Training the model
  • Storing the trained model
  • Deployment of the model

12 / 81

slide-22
SLIDE 22

◮ Which components of a DNN would require intense hardware resource? ◮ A few candidates are:

  • Preprocessing input data
  • Training the model
  • Storing the trained model
  • Deployment of the model

13 / 81

slide-23
SLIDE 23

Training a Model

◮ Forward pass: input is passed through the DNN and an output is generated. ◮ Backward pass: weights are updated on the basis of error we get in forward pass. 14 / 81

slide-24
SLIDE 24

Training a Model

◮ Forward pass: input is passed through the DNN and an output is generated. ◮ Backward pass: weights are updated on the basis of error we get in forward pass. ◮ Both of these operations are essentially matrix multiplications. 14 / 81

slide-25
SLIDE 25

How to Train a Model Faster?

◮ The computationally intensive part of neural network is made up of multiple matrix

multiplications.

◮ How can we make it faster? 15 / 81

slide-26
SLIDE 26

How to Train a Model Faster?

◮ The computationally intensive part of neural network is made up of multiple matrix

multiplications.

◮ How can we make it faster? ◮ Do these operations at the same time, instead of doing it one after the other. 15 / 81

slide-27
SLIDE 27

How to Train a Model Faster?

◮ The computationally intensive part of neural network is made up of multiple matrix

multiplications.

◮ How can we make it faster? ◮ Do these operations at the same time, instead of doing it one after the other. ◮ This is in a nutshell why we use GPU instead

  • f a CPU for training a neural network.

15 / 81

slide-28
SLIDE 28

Placing Operations and Variables on Devices (1/4)

◮ For now, lets asume to run everything on a single machine. 16 / 81

slide-29
SLIDE 29

Placing Operations and Variables on Devices (2/4)

◮ Place the data preprocessing operations on CPUs, and the NN operations on GPUs. 17 / 81

slide-30
SLIDE 30

Placing Operations and Variables on Devices (2/4)

◮ Place the data preprocessing operations on CPUs, and the NN operations on GPUs. ◮ Adding more CPU RAM to a machine is simple and cheap, whereas the GPU RAM

is an expensive and limited resource.

17 / 81

slide-31
SLIDE 31

Placing Operations and Variables on Devices (2/4)

◮ Place the data preprocessing operations on CPUs, and the NN operations on GPUs. ◮ Adding more CPU RAM to a machine is simple and cheap, whereas the GPU RAM

is an expensive and limited resource.

  • If a variable is not needed in the next few training steps, it should probably be placed
  • n the CPU (e.g., datasets generally belong on the CPU).

17 / 81

slide-32
SLIDE 32

Placing Operations and Variables on Devices (2/4)

◮ Place the data preprocessing operations on CPUs, and the NN operations on GPUs. ◮ Adding more CPU RAM to a machine is simple and cheap, whereas the GPU RAM

is an expensive and limited resource.

  • If a variable is not needed in the next few training steps, it should probably be placed
  • n the CPU (e.g., datasets generally belong on the CPU).

◮ GPUs usually have a fairly limited communication bandwidth, so it is important to

avoid unnecessary data transfers in and out of the GPUs.

17 / 81

slide-33
SLIDE 33

Placing Operations and Variables on Devices (3/4)

◮ By default, all variables/operations are placed on the first GPU: /gpu:0. 18 / 81

slide-34
SLIDE 34

Placing Operations and Variables on Devices (3/4)

◮ By default, all variables/operations are placed on the first GPU: /gpu:0. ◮ Variables/operations that do not have a GPU kernel are placed on the CPU: /cpu:0. 18 / 81

slide-35
SLIDE 35

Placing Operations and Variables on Devices (3/4)

◮ By default, all variables/operations are placed on the first GPU: /gpu:0. ◮ Variables/operations that do not have a GPU kernel are placed on the CPU: /cpu:0. ◮ A kernel is a variable or operation’s implementation for a specific data and device

type.

18 / 81

slide-36
SLIDE 36

Placing Operations and Variables on Devices (3/4)

◮ By default, all variables/operations are placed on the first GPU: /gpu:0. ◮ Variables/operations that do not have a GPU kernel are placed on the CPU: /cpu:0. ◮ A kernel is a variable or operation’s implementation for a specific data and device

type.

  • For example, there is a GPU kernel for the float32 tf.matmul() operation, but there

is no GPU kernel for int32 tf.matmul() (only a CPU kernel).

18 / 81

slide-37
SLIDE 37

Placing Operations and Variables on Devices (4/4)

◮ TensorFlow automatically decides which device to execute an operation and copies

tensors to that device.

◮ However, TensorFlow operations can be explicitly placed on specific devices using the

tf.device context manager.

19 / 81

slide-38
SLIDE 38

Manual Device Placement (1/3)

◮ Use with tf.device to create a device context. ◮ All the operations within that context will run on the same designated device. 20 / 81

slide-39
SLIDE 39

Manual Device Placement (1/3)

◮ Use with tf.device to create a device context. ◮ All the operations within that context will run on the same designated device. tf.debugging.set_log_device_placement(True) a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]) b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]) c = tf.matmul(a, b) print(c) 20 / 81

slide-40
SLIDE 40

Manual Device Placement (1/3)

◮ Use with tf.device to create a device context. ◮ All the operations within that context will run on the same designated device. tf.debugging.set_log_device_placement(True) a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]) b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]) c = tf.matmul(a, b) print(c) Output: Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0 tf.Tensor( [[22. 28.] [49. 64.]], shape=(2, 2), dtype=float32) 20 / 81

slide-41
SLIDE 41

Manual Device Placement (2/3)

tf.debugging.set_log_device_placement(True) with tf.device(’/cpu:0’): a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]) b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]) c = tf.matmul(a, b) print(c) 21 / 81

slide-42
SLIDE 42

Manual Device Placement (2/3)

tf.debugging.set_log_device_placement(True) with tf.device(’/cpu:0’): a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]) b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]) c = tf.matmul(a, b) print(c) Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0 tf.Tensor( [[22. 28.] [49. 64.]], shape=(2, 2), dtype=float32) 21 / 81

slide-43
SLIDE 43

Manual Device Placement (2/3)

tf.debugging.set_log_device_placement(True) with tf.device(’/cpu:0’): a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]) b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]) c = tf.matmul(a, b) print(c) Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0 tf.Tensor( [[22. 28.] [49. 64.]], shape=(2, 2), dtype=float32) ◮ Here, a and b are assigned to CPU:0. ◮ Since a device was not explicitly specified for the matmul operation, it will be run on

the default device GPU:0.

21 / 81

slide-44
SLIDE 44

Manual Device Placement (3/3)

tf.debugging.set_log_device_placement(True) with tf.device(’/cpu:0’): a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]) b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]) c = tf.matmul(a, b) print(c) 22 / 81

slide-45
SLIDE 45

Manual Device Placement (3/3)

tf.debugging.set_log_device_placement(True) with tf.device(’/cpu:0’): a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]) b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]) c = tf.matmul(a, b) print(c) Executing op MatMul in device /job:localhost/replica:0/task:0/device:CPU:0 tf.Tensor( [[22. 28.] [49. 64.]], shape=(2, 2), dtype=float32) 22 / 81

slide-46
SLIDE 46

Parallel Execution Across Multiple Devices

23 / 81

slide-47
SLIDE 47

Parallelization

◮ Train large deep learning models with huge amounts of training data. 24 / 81

slide-48
SLIDE 48

Parallelization

◮ Train large deep learning models with huge amounts of training data. ◮ Parallelization and distribution are essential. 24 / 81

slide-49
SLIDE 49

Parallelization

◮ Train large deep learning models with huge amounts of training data. ◮ Parallelization and distribution are essential. ◮ Two main approaches to training a single model across multiple devices:

  • Model parallelization
  • Data parallelization

24 / 81

slide-50
SLIDE 50

Model Parallelization

25 / 81

slide-51
SLIDE 51

Model Parallelization

◮ The model is split across multiple devices.

[Mayer, R. et al., arXiv:1903.11314, 2019]

26 / 81

slide-52
SLIDE 52

Model Parallelization

◮ The model is split across multiple devices. ◮ Depends on the architecture of the NN.

[Mayer, R. et al., arXiv:1903.11314, 2019]

26 / 81

slide-53
SLIDE 53

Fully Connetected Model Parallelization (1/2)

◮ To place each layer on a different device. 27 / 81

slide-54
SLIDE 54

Fully Connetected Model Parallelization (1/2)

◮ To place each layer on a different device. ◮ Not good: each layer needs to wait for the output of the previous layer before it can

do anything.

27 / 81

slide-55
SLIDE 55

Fully Connetected Model Parallelization (2/2)

◮ Slice the model vertically.

  • E.g., the left half of each layer on one device, and the right part on another device.

28 / 81

slide-56
SLIDE 56

Fully Connetected Model Parallelization (2/2)

◮ Slice the model vertically.

  • E.g., the left half of each layer on one device, and the right part on another device.

◮ Slightly better: both halves of each layer can indeed work in parallel. 28 / 81

slide-57
SLIDE 57

Fully Connetected Model Parallelization (2/2)

◮ Slice the model vertically.

  • E.g., the left half of each layer on one device, and the right part on another device.

◮ Slightly better: both halves of each layer can indeed work in parallel. ◮ Each half of the next layer requires the output of both halves: lot of cross-device

communication.

28 / 81

slide-58
SLIDE 58

CNN Model Parallelization

◮ Some NN, such as CNN, contains layers that are only partially connected to the lower

layers.

29 / 81

slide-59
SLIDE 59

CNN Model Parallelization

◮ Some NN, such as CNN, contains layers that are only partially connected to the lower

layers.

◮ Easier to distribute the model across devices in an efficient way. 29 / 81

slide-60
SLIDE 60

RNN Model Parallelization

◮ Split the NN horizontally by placing each layer on a different device. 30 / 81

slide-61
SLIDE 61

RNN Model Parallelization

◮ Split the NN horizontally by placing each layer on a different device. ◮ At the first step, only one device will be active. 30 / 81

slide-62
SLIDE 62

RNN Model Parallelization

◮ Split the NN horizontally by placing each layer on a different device. ◮ At the first step, only one device will be active. ◮ At the second step, two will be active. 30 / 81

slide-63
SLIDE 63

RNN Model Parallelization

◮ Split the NN horizontally by placing each layer on a different device. ◮ At the first step, only one device will be active. ◮ At the second step, two will be active. ◮ While the first layer will be handling the

second value, the second layer will be handling the output of the first layer for the first value.

30 / 81

slide-64
SLIDE 64

RNN Model Parallelization

◮ Split the NN horizontally by placing each layer on a different device. ◮ At the first step, only one device will be active. ◮ At the second step, two will be active. ◮ While the first layer will be handling the

second value, the second layer will be handling the output of the first layer for the first value.

◮ By the time the signal propagates to the

  • utput layer, all devices will be active

simultaneously.

30 / 81

slide-65
SLIDE 65

Data Parallelization

31 / 81

slide-66
SLIDE 66

Data Parallelization (1/2)

◮ Replicate a whole model on every device. ◮ Train all replicas simultaneously, using a different mini-batch for each.

[Mayer, R. et al., arXiv:1903.11314, 2019]

32 / 81

slide-67
SLIDE 67

Data Parallelization (2/2)

  • 1. Compute the gradient of the loss function using a mini-batch on each GPU.

33 / 81

slide-68
SLIDE 68

Data Parallelization (2/2)

  • 1. Compute the gradient of the loss function using a mini-batch on each GPU.
  • 2. Compute the mean of the gradients by inter-GPU communication.

33 / 81

slide-69
SLIDE 69

Data Parallelization (2/2)

  • 1. Compute the gradient of the loss function using a mini-batch on each GPU.
  • 2. Compute the mean of the gradients by inter-GPU communication.
  • 3. Update the model.

33 / 81

slide-70
SLIDE 70

Data Parallelization Design Issues

◮ System Architecture: how to synchronize the parameters 34 / 81

slide-71
SLIDE 71

Data Parallelization Design Issues

◮ System Architecture: how to synchronize the parameters ◮ Synchronization: when to synchronize the parameters 34 / 81

slide-72
SLIDE 72

System Architecture

35 / 81

slide-73
SLIDE 73

System Architecture - Centralized

◮ How to aggregate gradients (compute the mean of the gradients)? ◮ How the parameters of the different replicas are synchronized? 36 / 81

slide-74
SLIDE 74

System Architecture - Centralized

◮ Store the model parameters outside of the workers. 37 / 81

slide-75
SLIDE 75

System Architecture - Centralized

◮ Store the model parameters outside of the workers. ◮ Workers periodically report their computed parameters or parameter updates to a

(set of) parameter server(s) (PSs).

37 / 81

slide-76
SLIDE 76

System Architecture - Centralized

◮ Store the model parameters outside of the workers. ◮ Workers periodically report their computed parameters or parameter updates to a

(set of) parameter server(s) (PSs).

37 / 81

slide-77
SLIDE 77

System Architecture - Decentralized

◮ Mirror all the model parameters across all workers (No PS). 38 / 81

slide-78
SLIDE 78

System Architecture - Decentralized

◮ Mirror all the model parameters across all workers (No PS). ◮ Workers exchange parameter updates directly via an allreduce operation. 38 / 81

slide-79
SLIDE 79

Reduce and AllReduce (1/2)

◮ Reduce: reducing a set of numbers into a smaller set of numbers via a function. 39 / 81

slide-80
SLIDE 80

Reduce and AllReduce (1/2)

◮ Reduce: reducing a set of numbers into a smaller set of numbers via a function. ◮ E.g., sum([1, 2, 3, 4, 5]) = 15 39 / 81

slide-81
SLIDE 81

Reduce and AllReduce (1/2)

◮ Reduce: reducing a set of numbers into a smaller set of numbers via a function. ◮ E.g., sum([1, 2, 3, 4, 5]) = 15 ◮ Reduce takes an array of input elements on each process and returns an array of

  • utput elements to the root process.

39 / 81

slide-82
SLIDE 82

Reduce and AllReduce (1/2)

◮ Reduce: reducing a set of numbers into a smaller set of numbers via a function. ◮ E.g., sum([1, 2, 3, 4, 5]) = 15 ◮ Reduce takes an array of input elements on each process and returns an array of

  • utput elements to the root process.

[https://mpitutorial.com/tutorials/mpi-reduce-and-allreduce]

39 / 81

slide-83
SLIDE 83

Reduce and AllReduce (2/2)

◮ AllReduce stores reduced results across all processes rather than the root process. 40 / 81

slide-84
SLIDE 84

Reduce and AllReduce (2/2)

◮ AllReduce stores reduced results across all processes rather than the root process.

[https://mpitutorial.com/tutorials/mpi-reduce-and-allreduce]

40 / 81

slide-85
SLIDE 85

AllReduce Example

Initial state After AllReduce operation

[https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da]

41 / 81

slide-86
SLIDE 86

AllReduce Implementation

◮ All-to-all allreduce ◮ Master-worker allreduce ◮ Tree allreduce ◮ Round-robin allreduce ◮ Butterfly allreduce ◮ Ring allreduce 42 / 81

slide-87
SLIDE 87

AllReduce Implementation - All-to-All AllReduce

◮ Send the array of data to each other. ◮ Apply the reduction operation on each process.

[https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da]

43 / 81

slide-88
SLIDE 88

AllReduce Implementation - All-to-All AllReduce

◮ Send the array of data to each other. ◮ Apply the reduction operation on each process. ◮ Too many unnecessary messages.

[https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da]

43 / 81

slide-89
SLIDE 89

AllReduce Implementation - Master-Worker AllReduce

◮ Selecting one process as a master, gather all arrays into the master. ◮ Perform reduction operations locally in the master. ◮ Distribute the result to the other processes.

[https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da]

44 / 81

slide-90
SLIDE 90

AllReduce Implementation - Master-Worker AllReduce

◮ Selecting one process as a master, gather all arrays into the master. ◮ Perform reduction operations locally in the master. ◮ Distribute the result to the other processes. ◮ The master becomes a bottleneck (not scalable).

[https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da]

44 / 81

slide-91
SLIDE 91

AllReduce Implementation - Other implementations

◮ Some try to minimize bandwidth. ◮ Some try to minimize latency.

[Zhao H. et al., arXiv:1312.3020, 2013]

45 / 81

slide-92
SLIDE 92

AllReduce Implementation - Ring-AllReduce (1/6)

◮ The Ring-Allreduce has two phases:

  • 1. First, the share-reduce phase
  • 2. Then, the share-only phase

46 / 81

slide-93
SLIDE 93

AllReduce Implementation - Ring-AllReduce (2/6)

◮ In the share-reduce phase, each process p sends data to the process (p+1) % m

  • m is the number of processes, and % is the modulo operator.

[https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da]

47 / 81

slide-94
SLIDE 94

AllReduce Implementation - Ring-AllReduce (2/6)

◮ In the share-reduce phase, each process p sends data to the process (p+1) % m

  • m is the number of processes, and % is the modulo operator.

◮ The array of data on each process is divided to m chunks (m=4 here).

[https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da]

47 / 81

slide-95
SLIDE 95

AllReduce Implementation - Ring-AllReduce (2/6)

◮ In the share-reduce phase, each process p sends data to the process (p+1) % m

  • m is the number of processes, and % is the modulo operator.

◮ The array of data on each process is divided to m chunks (m=4 here). ◮ Each one of these chunks will be indexed by i going forward.

[https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da]

47 / 81

slide-96
SLIDE 96

AllReduce Implementation - Ring-AllReduce (3/6)

◮ In the first share-reduce step, process A sends a0 to process B.

[https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da]

48 / 81

slide-97
SLIDE 97

AllReduce Implementation - Ring-AllReduce (3/6)

◮ In the first share-reduce step, process A sends a0 to process B. ◮ Process B sends b1 to process C, etc.

[https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da]

48 / 81

slide-98
SLIDE 98

AllReduce Implementation - Ring-AllReduce (4/6)

◮ When each process receives the data from the previous process, it applies the reduce

  • perator (e.g., sum or mean)

[https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da]

49 / 81

slide-99
SLIDE 99

AllReduce Implementation - Ring-AllReduce (4/6)

◮ When each process receives the data from the previous process, it applies the reduce

  • perator (e.g., sum or mean)
  • The reduce operator should be associative and commutative.

[https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da]

49 / 81

slide-100
SLIDE 100

AllReduce Implementation - Ring-AllReduce (4/6)

◮ When each process receives the data from the previous process, it applies the reduce

  • perator (e.g., sum or mean)
  • The reduce operator should be associative and commutative.

◮ It then proceeds to send it to the next process in the ring.

[https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da]

49 / 81

slide-101
SLIDE 101

AllReduce Implementation - Ring-AllReduce (5/6)

◮ The share-reduce phase finishes when each process holds the complete reduction of

chunk i.

[https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da]

50 / 81

slide-102
SLIDE 102

AllReduce Implementation - Ring-AllReduce (5/6)

◮ The share-reduce phase finishes when each process holds the complete reduction of

chunk i.

◮ At this point each process holds a part of the end result.

[https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da]

50 / 81

slide-103
SLIDE 103

AllReduce Implementation - Ring-AllReduce (6/6)

◮ The share-only step is the same process of sharing the data in a ring-like fashion

without applying the reduce operation.

[https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da]

51 / 81

slide-104
SLIDE 104

AllReduce Implementation - Ring-AllReduce (6/6)

◮ The share-only step is the same process of sharing the data in a ring-like fashion

without applying the reduce operation.

◮ This consolidates the result of each chunk in every process.

[https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da]

51 / 81

slide-105
SLIDE 105

Master-Worker AllReduce vs. Ring-AllReduce

◮ N: number of elements, m: number of processes 52 / 81

slide-106
SLIDE 106

Master-Worker AllReduce vs. Ring-AllReduce

◮ N: number of elements, m: number of processes ◮ Master-Worker AllReduce 52 / 81

slide-107
SLIDE 107

Master-Worker AllReduce vs. Ring-AllReduce

◮ N: number of elements, m: number of processes ◮ Master-Worker AllReduce

  • First each process sends N elements to the master: N × (m − 1) messages.

52 / 81

slide-108
SLIDE 108

Master-Worker AllReduce vs. Ring-AllReduce

◮ N: number of elements, m: number of processes ◮ Master-Worker AllReduce

  • First each process sends N elements to the master: N × (m − 1) messages.
  • Then the master sends the results back to the process: another N × (m − 1) messages.

52 / 81

slide-109
SLIDE 109

Master-Worker AllReduce vs. Ring-AllReduce

◮ N: number of elements, m: number of processes ◮ Master-Worker AllReduce

  • First each process sends N elements to the master: N × (m − 1) messages.
  • Then the master sends the results back to the process: another N × (m − 1) messages.
  • Total network traffic is 2(N × (m − 1)), which is proportional to m.

52 / 81

slide-110
SLIDE 110

Master-Worker AllReduce vs. Ring-AllReduce

◮ N: number of elements, m: number of processes ◮ Master-Worker AllReduce

  • First each process sends N elements to the master: N × (m − 1) messages.
  • Then the master sends the results back to the process: another N × (m − 1) messages.
  • Total network traffic is 2(N × (m − 1)), which is proportional to m.

◮ Ring-AllReduce 52 / 81

slide-111
SLIDE 111

Master-Worker AllReduce vs. Ring-AllReduce

◮ N: number of elements, m: number of processes ◮ Master-Worker AllReduce

  • First each process sends N elements to the master: N × (m − 1) messages.
  • Then the master sends the results back to the process: another N × (m − 1) messages.
  • Total network traffic is 2(N × (m − 1)), which is proportional to m.

◮ Ring-AllReduce

  • In the share-reduce step each process sends N

m elements, and it does it m − 1 times: N m × (m − 1) messages.

52 / 81

slide-112
SLIDE 112

Master-Worker AllReduce vs. Ring-AllReduce

◮ N: number of elements, m: number of processes ◮ Master-Worker AllReduce

  • First each process sends N elements to the master: N × (m − 1) messages.
  • Then the master sends the results back to the process: another N × (m − 1) messages.
  • Total network traffic is 2(N × (m − 1)), which is proportional to m.

◮ Ring-AllReduce

  • In the share-reduce step each process sends N

m elements, and it does it m − 1 times: N m × (m − 1) messages.

  • On the share-only step, each process sends the result for the chunk it calculated: another

N m × (m − 1) messages.

52 / 81

slide-113
SLIDE 113

Master-Worker AllReduce vs. Ring-AllReduce

◮ N: number of elements, m: number of processes ◮ Master-Worker AllReduce

  • First each process sends N elements to the master: N × (m − 1) messages.
  • Then the master sends the results back to the process: another N × (m − 1) messages.
  • Total network traffic is 2(N × (m − 1)), which is proportional to m.

◮ Ring-AllReduce

  • In the share-reduce step each process sends N

m elements, and it does it m − 1 times: N m × (m − 1) messages.

  • On the share-only step, each process sends the result for the chunk it calculated: another

N m × (m − 1) messages.

  • Total network traffic is 2( N

m × (m − 1)).

52 / 81

slide-114
SLIDE 114

Synchronization

53 / 81

slide-115
SLIDE 115

Synchronization

◮ When to synchronize the parameters among the parallel workers? 54 / 81

slide-116
SLIDE 116

Synchronization - Synchronous

◮ After each iteration (processing of a mini-batch), the workers synchronize their pa-

rameter updates.

[Mayer, R. et al., arXiv:1903.11314, 2019]

55 / 81

slide-117
SLIDE 117

Synchronization - Synchronous

◮ After each iteration (processing of a mini-batch), the workers synchronize their pa-

rameter updates.

  • Easy to reason about the model convergence.

[Mayer, R. et al., arXiv:1903.11314, 2019]

55 / 81

slide-118
SLIDE 118

Synchronization - Synchronous

◮ After each iteration (processing of a mini-batch), the workers synchronize their pa-

rameter updates.

  • Easy to reason about the model convergence.
  • The training process prone to the straggler problem, where the slowest worker slows

down all the others.

[Mayer, R. et al., arXiv:1903.11314, 2019]

55 / 81

slide-119
SLIDE 119

Synchronization - Asynchronous

◮ Workers update their model independently from each other.

[Mayer, R. et al., arXiv:1903.11314, 2019]

56 / 81

slide-120
SLIDE 120

Synchronization - Asynchronous

◮ Workers update their model independently from each other.

  • A worker may train on stale (delayed) parameters.

[Mayer, R. et al., arXiv:1903.11314, 2019]

56 / 81

slide-121
SLIDE 121

Synchronization - Asynchronous

◮ Workers update their model independently from each other.

  • A worker may train on stale (delayed) parameters.
  • This makes it hard to mathematically reason about the model convergence.

[Mayer, R. et al., arXiv:1903.11314, 2019]

56 / 81

slide-122
SLIDE 122

Synchronization - Asynchronous

◮ Workers update their model independently from each other.

  • A worker may train on stale (delayed) parameters.
  • This makes it hard to mathematically reason about the model convergence.
  • It provides the workers flexibility in their training process, completely avoiding all strag-

gler problems.

[Mayer, R. et al., arXiv:1903.11314, 2019]

56 / 81

slide-123
SLIDE 123

Data Parallelization in TensorFlow

57 / 81

slide-124
SLIDE 124

TensorFlow Distribution Strategies

◮ tf.distribute.Strategy is a TensorFlow API to distribute training. ◮ Supports both parameter server and allreduce models. 58 / 81

slide-125
SLIDE 125

Single Server

59 / 81

slide-126
SLIDE 126

Single Server Training - MirroredStrategy (1/2)

◮ Synchronous distribute training training on multiple GPUs on one machine. mirrored_strategy = tf.distribute.MirroredStrategy() # to use only some of the GPUs on your machine mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"]) 60 / 81

slide-127
SLIDE 127

Single Server Training - MirroredStrategy (1/2)

◮ Synchronous distribute training training on multiple GPUs on one machine. ◮ One replica per GPU. mirrored_strategy = tf.distribute.MirroredStrategy() # to use only some of the GPUs on your machine mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"]) 60 / 81

slide-128
SLIDE 128

Single Server Training - MirroredStrategy (1/2)

◮ Synchronous distribute training training on multiple GPUs on one machine. ◮ One replica per GPU. ◮ The parameters of the model are mirrored across all the replicas. mirrored_strategy = tf.distribute.MirroredStrategy() # to use only some of the GPUs on your machine mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"]) 60 / 81

slide-129
SLIDE 129

Single Server Training - MirroredStrategy (1/2)

◮ Synchronous distribute training training on multiple GPUs on one machine. ◮ One replica per GPU. ◮ The parameters of the model are mirrored across all the replicas. ◮ These parameters are kept in sync with each other by applying identical updates. mirrored_strategy = tf.distribute.MirroredStrategy() # to use only some of the GPUs on your machine mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"]) 60 / 81

slide-130
SLIDE 130

Single Server Training - MirroredStrategy (1/2)

◮ Synchronous distribute training training on multiple GPUs on one machine. ◮ One replica per GPU. ◮ The parameters of the model are mirrored across all the replicas. ◮ These parameters are kept in sync with each other by applying identical updates. ◮ The parameters updates are communicated using allreduce algorithms. mirrored_strategy = tf.distribute.MirroredStrategy() # to use only some of the GPUs on your machine mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"]) 60 / 81

slide-131
SLIDE 131

Single Server Training - MirroredStrategy (2/2)

◮ There are different implementation of allreduce. ◮ You can override the cross GPU communication:

  • tf.distribute.NcclAllReduce (the default)
  • tf.distribute.ReductionToOneDevice
  • tf.distribute.HierarchicalCopyAllReduce

mirrored_strategy = tf.distribute.MirroredStrategy( cross_device_ops=tf.distribute.HierarchicalCopyAllReduce()) 61 / 81

slide-132
SLIDE 132

Single Server Training - CentralStorageStrategy

◮ Parameters are not mirrored, instead they are placed on the CPU. central_storage_strategy = tf.distribute.experimental.CentralStorageStrategy() 62 / 81

slide-133
SLIDE 133

Single Server Training - CentralStorageStrategy

◮ Parameters are not mirrored, instead they are placed on the CPU. ◮ Operations are replicated across all local GPUs. central_storage_strategy = tf.distribute.experimental.CentralStorageStrategy() 62 / 81

slide-134
SLIDE 134

Single Server Training - CentralStorageStrategy

◮ Parameters are not mirrored, instead they are placed on the CPU. ◮ Operations are replicated across all local GPUs. ◮ Does synchronous training. central_storage_strategy = tf.distribute.experimental.CentralStorageStrategy() 62 / 81

slide-135
SLIDE 135

Single Server Trainings - Example

◮ Creat a strategy, e.g., MirroredStrategy or CentralStorageStrategy. distribution = tf.distribute.MirroredStrategy() with distribution.scope(): model = keras.models.Sequential([...]) model.compile(...) model.fit(...) model.predict(...) 63 / 81

slide-136
SLIDE 136

Single Server Trainings - Example

◮ Creat a strategy, e.g., MirroredStrategy or CentralStorageStrategy. ◮ Call its scope() method to get a distribution context. distribution = tf.distribute.MirroredStrategy() with distribution.scope(): model = keras.models.Sequential([...]) model.compile(...) model.fit(...) model.predict(...) 63 / 81

slide-137
SLIDE 137

Single Server Trainings - Example

◮ Creat a strategy, e.g., MirroredStrategy or CentralStorageStrategy. ◮ Call its scope() method to get a distribution context. ◮ Wrap the creation and compilation of the model inside that context. distribution = tf.distribute.MirroredStrategy() with distribution.scope(): model = keras.models.Sequential([...]) model.compile(...) model.fit(...) model.predict(...) 63 / 81

slide-138
SLIDE 138

Single Server Trainings - Example

◮ Creat a strategy, e.g., MirroredStrategy or CentralStorageStrategy. ◮ Call its scope() method to get a distribution context. ◮ Wrap the creation and compilation of the model inside that context. ◮ Call the model’s fit() and predict() method normally (outside the context). distribution = tf.distribute.MirroredStrategy() with distribution.scope(): model = keras.models.Sequential([...]) model.compile(...) model.fit(...) model.predict(...) 63 / 81

slide-139
SLIDE 139

Multi Servers

64 / 81

slide-140
SLIDE 140

Multi Servers Trainings - MultiWorkerMirroredStrategy (1/2)

◮ Very similar to MirroredStrategy. multiworker_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy() 65 / 81

slide-141
SLIDE 141

Multi Servers Trainings - MultiWorkerMirroredStrategy (1/2)

◮ Very similar to MirroredStrategy. ◮ Synchronous distributed training across multiple workers, each with potentially mul-

tiple GPUs.

multiworker_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy() 65 / 81

slide-142
SLIDE 142

Multi Servers Trainings - MultiWorkerMirroredStrategy (1/2)

◮ Very similar to MirroredStrategy. ◮ Synchronous distributed training across multiple workers, each with potentially mul-

tiple GPUs.

◮ Makes copies of all parameters of the model on each device across all workers. multiworker_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy() 65 / 81

slide-143
SLIDE 143

Multi Servers Trainings - MultiWorkerMirroredStrategy (2/2)

◮ Two different implementations:

  • CollectiveCommunication.RING (ring-based implementation)
  • CollectiveCommunication.NCCL (Nvidia’s NCCL implementation)

# ring-based collectives multiworker_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy( tf.distribute.experimental.CollectiveCommunication.RING) 66 / 81

slide-144
SLIDE 144

Multi Servers Trainings - MultiWorkerMirroredStrategy (2/2)

◮ Two different implementations:

  • CollectiveCommunication.RING (ring-based implementation)
  • CollectiveCommunication.NCCL (Nvidia’s NCCL implementation)

◮ CollectiveCommunication.AUTO defers the choice to the runtime. # ring-based collectives multiworker_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy( tf.distribute.experimental.CollectiveCommunication.RING) 66 / 81

slide-145
SLIDE 145

Multi Servers Trainings - MultiWorkerMirroredStrategy (2/2)

◮ Two different implementations:

  • CollectiveCommunication.RING (ring-based implementation)
  • CollectiveCommunication.NCCL (Nvidia’s NCCL implementation)

◮ CollectiveCommunication.AUTO defers the choice to the runtime. ◮ The best choice of collective implementation depends upon the number and kind of

GPUs, and the network interconnect in the cluster.

# ring-based collectives multiworker_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy( tf.distribute.experimental.CollectiveCommunication.RING) 66 / 81

slide-146
SLIDE 146

Multi Servers Trainings - ParameterServerStrategy

◮ Supports parameter servers training on multiple machines. ps_strategy = tf.distribute.experimental.ParameterServerStrategy() 67 / 81

slide-147
SLIDE 147

Multi Servers Trainings - ParameterServerStrategy

◮ Supports parameter servers training on multiple machines. ◮ Some machines are designated as workers and some as parameter servers. ps_strategy = tf.distribute.experimental.ParameterServerStrategy() 67 / 81

slide-148
SLIDE 148

Multi Servers Trainings - ParameterServerStrategy

◮ Supports parameter servers training on multiple machines. ◮ Some machines are designated as workers and some as parameter servers. ◮ Each parameter of the model is placed on one parameter server. ps_strategy = tf.distribute.experimental.ParameterServerStrategy() 67 / 81

slide-149
SLIDE 149

Multi Servers Trainings - ParameterServerStrategy

◮ Supports parameter servers training on multiple machines. ◮ Some machines are designated as workers and some as parameter servers. ◮ Each parameter of the model is placed on one parameter server. ◮ Computation is replicated across all GPUs of all the workers. ps_strategy = tf.distribute.experimental.ParameterServerStrategy() 67 / 81

slide-150
SLIDE 150

Multi Servers Trainings - More Details

◮ A TensorFlow cluster is a group of TensorFlow processes running in parallel. 68 / 81

slide-151
SLIDE 151

Multi Servers Trainings - More Details

◮ A TensorFlow cluster is a group of TensorFlow processes running in parallel. ◮ Each TF process (a.k.a task) in the cluster has a type: 68 / 81

slide-152
SLIDE 152

Multi Servers Trainings - More Details

◮ A TensorFlow cluster is a group of TensorFlow processes running in parallel. ◮ Each TF process (a.k.a task) in the cluster has a type:

  • Worker: performs computations, usually on a machine with one or more GPUs.

68 / 81

slide-153
SLIDE 153

Multi Servers Trainings - More Details

◮ A TensorFlow cluster is a group of TensorFlow processes running in parallel. ◮ Each TF process (a.k.a task) in the cluster has a type:

  • Worker: performs computations, usually on a machine with one or more GPUs.
  • Parameter Server (ps): keeps track of parameters values, it is usually on a CPU-only

machine.

68 / 81

slide-154
SLIDE 154

Multi Servers Trainings - More Details

◮ A TensorFlow cluster is a group of TensorFlow processes running in parallel. ◮ Each TF process (a.k.a task) in the cluster has a type:

  • Worker: performs computations, usually on a machine with one or more GPUs.
  • Parameter Server (ps): keeps track of parameters values, it is usually on a CPU-only

machine.

◮ The set of tasks that share the same type is often called a job. For example, the

worker job is the set of all workers.

68 / 81

slide-155
SLIDE 155

Multi Servers Trainings - Example (1/3)

◮ Assume a cluster with 3 tasks (2 workers and 1 parameter server). cluster_spec = tf.train.ClusterSpec({ "worker": [ "machine-a.example.com:2222", # /job:worker/task:0 "machine-b.example.com:2222" # /job:worker/task:1 ], "ps": ["machine-a.example.com:2221"] # /job:ps/task:0 }) 69 / 81

slide-156
SLIDE 156

Multi Servers Trainings - Example (2/3)

◮ To start a task, you must give it the cluster spec and define its type and index (ID),

e.g., worker 0.

ps0 = tf.distribute.Server(cluster_spec, job_name="ps", task_index=0) 70 / 81

slide-157
SLIDE 157

Multi Servers Trainings - Example (2/3)

◮ To start a task, you must give it the cluster spec and define its type and index (ID),

e.g., worker 0.

ps0 = tf.distribute.Server(cluster_spec, job_name="ps", task_index=0) worker0 = tf.distribute.Server(cluster_spec, job_name="worker", task_index=0) 70 / 81

slide-158
SLIDE 158

Multi Servers Trainings - Example (2/3)

◮ To start a task, you must give it the cluster spec and define its type and index (ID),

e.g., worker 0.

ps0 = tf.distribute.Server(cluster_spec, job_name="ps", task_index=0) worker0 = tf.distribute.Server(cluster_spec, job_name="worker", task_index=0) worker1 = tf.distribute.Server(cluster_spec, job_name="worker", task_index=1) 70 / 81

slide-159
SLIDE 159

Multi Servers Trainings - Example (3/3)

◮ Alternative way to specify a cluster spec is to use the TF CONFIG environment variable

before starting the program.

◮ For example to run worker 1: distribution = tf.distribute.experimental.ParameterServerStrategy()

  • s.environ["TF_CONFIG"] = json.dumps({

"cluster": { "worker": ["machine-a.example.com:2222", "machine-b.example.com:2222"], "ps": ["machine-a.example.com:2221"]}, "task": {"type": "worker", "index": 1} }) with distribution.scope(): model = keras.models.Sequential([...]) model.compile(...) model.fit(...) 71 / 81

slide-160
SLIDE 160

Communication Overhead

72 / 81

slide-161
SLIDE 161

Communication Overhead in Data Parallelization

◮ Synchronizing the model replicas in data-parallel training requires communication

between workers (in allreduce)

73 / 81

slide-162
SLIDE 162

Communication Overhead in Data Parallelization

◮ Synchronizing the model replicas in data-parallel training requires communication

between workers (in allreduce)

◮ Between workers and parameter servers (in the centralized architecture). 73 / 81

slide-163
SLIDE 163

Communication Overhead in Data Parallelization

◮ Synchronizing the model replicas in data-parallel training requires communication

between workers (in allreduce)

◮ Between workers and parameter servers (in the centralized architecture). ◮ Such communication can easily become the bottleneck of the overall training process. 73 / 81

slide-164
SLIDE 164

Approaches for Communication Efficiency

◮ Reducing the model precision ◮ Compressing the model updates ◮ Improving the communication scheduling 74 / 81

slide-165
SLIDE 165

Reducing the Model Precision

◮ Reduce the precision of the parameters’ data types, e.g., from double precision to

single floating point.

75 / 81

slide-166
SLIDE 166

Reducing the Model Precision

◮ Reduce the precision of the parameters’ data types, e.g., from double precision to

single floating point.

◮ It saves communication bandwidth when parameter updates need to be transferred

  • ver the network.

75 / 81

slide-167
SLIDE 167

Reducing the Model Precision

◮ Reduce the precision of the parameters’ data types, e.g., from double precision to

single floating point.

◮ It saves communication bandwidth when parameter updates need to be transferred

  • ver the network.

◮ It reduces the model size, which can be useful when the model is deployed on resource-

constrained hardware such as GPUs.

75 / 81

slide-168
SLIDE 168

Compressing the Model Updates

◮ The model updates communicated between workers and between workers and pa-

rameter servers can be compressed.

76 / 81

slide-169
SLIDE 169

Compressing the Model Updates

◮ The model updates communicated between workers and between workers and pa-

rameter servers can be compressed.

◮ Gradient quantization: reducing the number of bits per gradient. 76 / 81

slide-170
SLIDE 170

Compressing the Model Updates

◮ The model updates communicated between workers and between workers and pa-

rameter servers can be compressed.

◮ Gradient quantization: reducing the number of bits per gradient. ◮ Gradient sparsification: communicating only important gradients that have a signifi-

cant value.

76 / 81

slide-171
SLIDE 171

Improving the Communication Scheduling

◮ Communication patterns in data-parallel are typically bursty, especially in syn-

chronous systems.

77 / 81

slide-172
SLIDE 172

Improving the Communication Scheduling

◮ Communication patterns in data-parallel are typically bursty, especially in syn-

chronous systems.

  • All workers may share their updated parameters at the same time with their peer

workers or parameter servers.

77 / 81

slide-173
SLIDE 173

Improving the Communication Scheduling

◮ Communication patterns in data-parallel are typically bursty, especially in syn-

chronous systems.

  • All workers may share their updated parameters at the same time with their peer

workers or parameter servers.

◮ To prevent that the network bandwidth is exceeded and communication is delayed,

the communication of the different workers can be scheduled such that it does not

  • verlap.

77 / 81

slide-174
SLIDE 174

Improving the Communication Scheduling

◮ Communication patterns in data-parallel are typically bursty, especially in syn-

chronous systems.

  • All workers may share their updated parameters at the same time with their peer

workers or parameter servers.

◮ To prevent that the network bandwidth is exceeded and communication is delayed,

the communication of the different workers can be scheduled such that it does not

  • verlap.
  • Prioritize specific messages over others.

77 / 81

slide-175
SLIDE 175

Summary

78 / 81

slide-176
SLIDE 176

Summary

◮ CPU vs. GPU ◮ Parallelization ◮ Model-parallel ◮ Data-parallel

  • Parameter server vs. AllReduce
  • Synchronized vs. asynchronoused

◮ Communication challenges 79 / 81

slide-177
SLIDE 177

Reference

◮ Aur´

elien G´ eron, Hands-On Machine Learning (Ch. 19)

◮ Mayer, R. et al., “Scalable Deep Learning on Distributed Infrastructures: Challenges,

Techniques and Tools”, 2019.

80 / 81

slide-178
SLIDE 178

Questions?

81 / 81