[PPT] - Numerically Stable Binary Gradient Coding Neophytos Charalambides PowerPoint Presentation

SLIDE 1

Numerically Stable Binary Gradient Coding

Neophytos Charalambides Hessam Mahdavifar Alfred Hero

Department of Electrical Engineering and Computer Science, University of Michigan

June, 2020

1 / 21

SLIDE 2

Outline for section 1

Introduction and Motivation Gradient Coding Problem Setup Binary Scheme Allocation to Heterogeneous Workers

2 / 21

SLIDE 3

Issues and Motivation

Introduction and Motivation

Machine Learning Today: Curse of Dimensionality ◮ Large Datasets — many samples ◮ Complex Datasets — large dimension ◮ Problems become intractable Use distributed methods ◮ Distribute smaller computation assignments ◮ Multiple servers complete various tasks Drawbacks of Distributed Synchronous Computations ◮ Requires all servers to respond — communication overhead ◮ What if stragglers are present? ◮ Stragglers — servers with delays or non-responsive

3 / 21

SLIDE 4

Gradient Coding1

Introduction and Motivation

1. Speed up distributive computation — gradient methods
2. Mitigate stragglers

1R Tandon et al. “Gradient Coding: Avoiding Stragglers in Synchronous

Gradient Descent”. In: stat 1050 (2017), p. 8.

4 / 21

SLIDE 5

Benefits of our Binary Scheme

Introduction and Motivation

Few schemes deal with exact recovery Common issues with current exact recovery schemes

1. construct and search through a decoding matrix 1 AT ∈ R(

n s)×n

2. storage issue, and further delay
3. work over R and C — further numerical instability
4. have a strict assumption that (s + 1) | n

Our scheme

1. faster online decoding
2. only deal with {0, 1} encodings — view as “task assignments”
3. ... this makes encoding and decoding numerically stable
4. works for any pair s, n
5. ... extend our construction to work for heterogeneous workers also

5 / 21

SLIDE 6

Outline for section 2

Introduction and Motivation Gradient Coding Problem Setup Binary Scheme Allocation to Heterogeneous Workers

6 / 21

SLIDE 7

Distributed Gradient Descent

Gradient Coding

◮ Dataset D = {(xi, yi)}N

i=1 Rp × R, or X ∈ RN×p; y ∈ RN

◮ Partition D =

k

j=1

Dj, s.t. Di ∩ Dj = ∅ and |Dj| = N

k

◮ Partial gradients gj — gradient on Dj

◮ Minimize the loss L(D; θ) =

k

j=1

ℓ(Dj; θ) ◮ Gradient descent updates: θ(t+1) = θ(t) − αtg(t)

◮ g (t) = ∇θL

D; θ(t)

=

k

j=1

g (t)

j

∇θℓ
Dj; θ(t)

=

k

j=1

g (t)

j

◮ additive structure allows g (t) to be computed in parallel!

7 / 21

SLIDE 8

Synchronous Distributed Computation

Gradient Coding

◮ Execute gradient descent distributively ◮ Need all workers to respond

Figure: Need all responses — g = g1 + g2 + g3

8 / 21

SLIDE 9

General Setup

Problem Setup

10 / 21

SLIDE 11

Encoding matrix

Problem Setup

◮ Rows: workers {Wi}n

i=1

◮ bi = encoding vector for Wi ◮ Columns: partitions {Dj}k

i=1

1. nonzero entries: assigned partitions
2. redundancy in assigned Dj’s

◮ Stragglers ≡ erasing rows of B

11 / 21

SLIDE 12

Example of our Binary Scheme

Binary Scheme

n = k = 11, s = 3 = ⇒ r ≡ 3 mod (s + 1) r workers for B1, and (s + 1 − r) for B2

B1 =               1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1               ∈ {0, 1}9×11 B2 =

1

1 1 1 1 1 1 1 1 1 1

∈ {0, 1}2×11

13 / 21

SLIDE 14

Example — Encoding and Decoding

Binary Scheme

Decoding: only take received workers of same color Example: aT

{2,6,10}B = 111×1

B =                   1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1                   aI ∈                                                      1 1 1                   ,                   1 1 1                   ,                   1 1 1                   ,                   1 1                                                     

14 / 21

SLIDE 15

Main Idea of Our Binary Scheme

Binary Scheme

◮ Have B as sparse as possible = ⇒ nnzr(B) = k · (s + 1) ◮ Work with congruence classes (mods + 1)

◮ superposition of rows of each class results in 11×k

◮ Allocate tasks s.t. bi0 ≃ bj0 for all i, j ∈ {1, · · · , n}, while satisfying the above two constraints ◮ Formally, construct B that is a solution to min

B∈Nn×k

n
i=1
bi0 −(s +1)·k/n
s.t. nnzr(B) = k ·(s +1)

◮ Intuition: B is close to being block diagonal

15 / 21

SLIDE 16

Construction and Decoding

Binary Scheme

◮ Congruence classes C1 = {[i]}r−1

i=0 and C2 = {[i]}s i=r :

1. r ≡ n mod (s + 1)
2. respectively identically
3. within each C1, C2, cardinalities do not differ by more than one
4. construct B1 and B2

◮ B = aggregation of B1 and B2 ◮ Decoding: By the pigeonhole principle, for any f workers, at least one complete residue system is present

16 / 21

SLIDE 17

Larger Example: n = k = 165 and s = 15 = ⇒ r = 5

Binary Scheme

Do not want a lot of redundancy — close to block diagonal

17 / 21

SLIDE 18

Outline for section 3

Introduction and Motivation Gradient Coding Problem Setup Binary Scheme Allocation to Heterogeneous Workers

18 / 21

SLIDE 19

Setup a Linear System

Allocation to Heterogeneous workers

◮ Assume two groups of different machines T1, T2, s.t. : ti = E[time for Ti to compute gj] and t1 t2 ◮ Goal: Want same expectation time for each worker ◮ Let |JTi| = # of partitions allocated to Ti’s workers ◮ Let |Ti| = τi and τ1 = α

β · τ2

Solve the linear system:

1. t1 · |JT1| = t2 · |JT2|
2. |JT1| · τ1 + |JT2| · τ2 = (s + 1) · k
3. τ2 = β

α · τ1

19 / 21

SLIDE 20

Main Takeaways of Our Scheme

◮ Gave a simple gradient coding scheme ◮ Faster online decoding ◮ Numerically stable in encoding and decoding ◮ Works for any pair s, n ◮ Extended it to accommodate heterogeneous workers also

20 / 21

SLIDE 21

Thank you for your attention!

SLIDE 22

Outline for section 4

Additional Slides Details of the constructions Explicit Algorithms

22 / 21

SLIDE 23

Idea Behind Binary Scheme

Details of the constructions

◮ When (s + 1) | n and k = n — B is block diagonal ◮ assign to each worker ℓ =

n

s+1

partitions in a repeated sense

◮ For (s + 1) ∤ n, each worker in blocks of (s + 1) rows corresponds to a distinct congruence class (c.c.) mod(s + 1) ◮ When any f workers send their computations, at least one congruence class is met in every block — pigeonhole ◮ ∃i ∈ Z/(s + 1) s.t.

i + j(s + 1)
∈ I, for all j = 0, 1, · · · , ℓ − 1

◮ there received workers “always form a coset” ◮ Decoding: select any such i, and sum the vectors received by the workers of the c.c. i — aT =

ℓ−1

j=0

ei+j(s+1) ◮ Want “even” number of assignments — homogeneous servers

23 / 21

SLIDE 24

Binary Scheme when (s + 1) ∤ n

Details of the constructions

◮ Determine the integer parameters

◮ n = ℓ · (s + 1) + r 0 ≤ r < s + 1 ◮ r = t · ℓ + q 0 ≤ q < ℓ ◮ n = λ · (ℓ + 1) + ˜ r 0 ≤ ˜ r < ℓ + 1

◮ Define: C1 := {[i]s+1}r−1

i=0

and C2 := {[i]s+1}s

i=r

◮ workers C1 lie in all (ℓ + 1) blocks, and C1 lie in first ℓ

◮ C1 load: {s + 1, s} if ℓ + r > s, o.w. {λ + 1, λ} ◮ C2 load: {s + t + 2, s + t + 1} if q > 0, o.w. all have s + t + 1

24 / 21

SLIDE 25

Encoding C1

Explicit Algorithms

Algorithm 1: Determining ˜

BC1

Input: number of workers n and stragglers s, where s < n both positive integers Output: encoding matrix ˜ BC1 ∈ {0, 1}n×n ⊲ for simplicity, we assume n = k ˜ BC1 ← 0n×n use division algorithm to get parameters: n = ℓ · (s + 1) r = t · ℓ + q n = λ · (s + 1) + r denote the sets of classes C1 = {[i]s+1}r−1

i=0

for i ∈ C1 do if ℓ + r > s then for j = 1 to ℓ + r − s do ˜ BC1

(j − 1)(s + 1) + i, (j − 1)(s + 1) + 1 : j(s + 1)
= 1s+1

end for j = ℓ + r − s + 1 to ℓ + 1 do ˜ BC1

(j − 1)(s + 1) + i, (j − 1)s + (ℓ + r − s) + 1 : (j − 1)s + ℓ + r
= 1s

end end else if ℓ + r ≤ s then for j = 1 to ˜ r do ˜ BC1

(j − 1)(s + 1) + i, (j − 1)(λ + 1) + 1 : j(λ + 1)
= 1λ+1

end for j = ˜ r + 1 to ℓ + 1 do ˜ BC1

(j − 1)(s + 1) + i, (j − 1)λ + ˜

r + 1 : (j − 1)λ + ˜ r + λ

= 1λ

end end end

return ˜ BC1

25 / 21

SLIDE 26

Encoding C2

Explicit Algorithms

Algorithm 2: Determining ˜

BC2

Input: number of workers n and stragglers s, where s < n both positive integers Output: encoding matrix ˜ BC2 ∈ {0, 1}n×n ⊲ for simplicity, we assume n = k ˜ BC2 ← 0n×n use division algorithm to get parameters: n = ℓ · (s + 1) r = t · ℓ + q n = λ · (s + 1) + r denote the sets of classes C2 = {[i]s+1}s

i=r

for i ∈ C2 do if q = 0 then for j = 1 to ℓ do ˜ BC2

(j − 1)(s + 1) + i, (j − 1)(s + t + 1) + 1 : j(s + t + 1)
= 1s+t+1

end end else if q > 0 then for j = 1 to q do ˜ BC2

(j − 1)(s + 1) + i, (j − 1)(s + t + 2) + 1 : j(s + t + 1)
= 1s+t+2

end for j = q + 1 to ℓ do B

(j − 1)(s + 1) + i, (j − 1)(s + t + 1) + q + 1 : j(s + t + 1) + q
= 1s+t+1

end end end return ˜ BC2 26 / 21

SLIDE 27

Decoding Vector

Explicit Algorithms

Algorithm 3: Determining aI

Input: received indicator-vector recI Output: decoding vector aI if r=0 then for i = 0 to s do if (recI)i = 1 then l ← i if supp(al) ⊆ supp(recI) then a ← al break end end end end else if r > 0 then run the above for-loop for i = r to s and then for i = 0 to r − 1 end return aI ← a

27 / 21

Numerically Stable Binary Gradient Coding Neophytos Charalambides - - PowerPoint PPT Presentation

Numerically Stable Binary Gradient Coding

Outline for section 1

Issues and Motivation

Gradient Coding1

Benefits of our Binary Scheme

Outline for section 2

Distributed Gradient Descent

Synchronous Distributed Computation

Table of Contents

General Setup

Encoding matrix

Table of Contents

Example of our Binary Scheme

Example — Encoding and Decoding

Main Idea of Our Binary Scheme

Construction and Decoding

Larger Example: n = k = 165 and s = 15 = ⇒ r = 5

Outline for section 3

Setup a Linear System

Main Takeaways of Our Scheme

Thank you for your attention!

Outline for section 4

Idea Behind Binary Scheme

Binary Scheme when (s + 1) ∤ n

Encoding C1

Encoding C2

Decoding Vector