[PPT] - Quantized Decentralized Stochastic Learning over Directed Graphs PowerPoint Presentation

SLIDE 1

Quantized Decentralized Stochastic Learning

ver Directed Graphs

Hossein Taheri1

Joint work with Aryan Mokhtari2, Hamed Hassani3, and Ramtin Pedarsani1

1University of California, Santa Barbara 2University of Texas, Austin 3University of Pennsylvania

Thirty-seventh International Conference on Machine Learning (ICML), 2020

1 / 30

SLIDE 2

Decentralized Optimization

Decentralized Stochastic Learning involves multiple agents or nodes that collect data, and want to learn an ML model collaboratively.

2 / 30

SLIDE 3

Decentralized Optimization

Decentralized Stochastic Learning involves multiple agents or nodes that collect data, and want to learn an ML model collaboratively. Applications including federated learning, multi-agent robotics systems, sensor networks, etc.

3 / 30

SLIDE 4

Decentralized Optimization

Decentralized Stochastic Learning involves multiple agents or nodes that collect data, and want to learn an ML model collaboratively. Applications including federated learning, multi-agent robotics systems, sensor networks, etc. In many cases, communication links are asymmetric due to failures and bottlenecks and communication is done over a directed graph [Tsianos et al. 2012, Nedic et al. 2014, Assran et al. 2020].

4 / 30

SLIDE 5

This Talk

Link failure: Nodes communicate over a directed graph High communication cost: Nodes communicate compressed information Q(x) Compression operator Q : Rd → Rd

5 / 30

SLIDE 6

Introduction: Push-sum Algorithm

Decentralized optimization over directed graphs with exact communication:      xi(t + 1) = n

j=1 wij xj(t) − α(t)∇fi (zi(t))

yi(t + 1) = n

j=1 wij yj(t)

zi(t + 1) = xi(t + 1)/yi(t + 1)

6 / 30

SLIDE 7

Introduction: Push-sum Algorithm

Decentralized optimization over directed graphs with exact communication:      xi(t + 1) = n

j=1 wij xj(t) − α(t)∇fi (zi(t))

yi(t + 1) = n

j=1 wij yj(t)

zi(t + 1) = xi(t + 1)/yi(t + 1) [Nedic et al. 2014] prove that for convex, Lipschitz objectives and α(t) = O(1/ √ T) ⇒ f ( zi(T)) − f ⋆ = O(1/ √ T),

zi(T) = 1

T

t=1 zi(t)

7 / 30

SLIDE 8

Introduction: Push-sum Algorithm

Decentralized optimization over directed graphs with exact communication:      xi(t + 1) = n

j=1 wij xj(t) − α(t)∇fi (zi(t))

yi(t + 1) = n

j=1 wij yj(t)

zi(t + 1) = xi(t + 1)/yi(t + 1) [Nedic et al. 2014] prove that for convex, Lipschitz objectives and α(t) = O(1/ √ T) ⇒ f ( zi(T)) − f ⋆ = O(1/ √ T),

zi(T) = 1

T

t=1 zi(t)

How can we incorporate quantized message exchanging for this setting?

8 / 30

SLIDE 9

Proposed Algorithm: Quantized Push-sum

We propose the quantized Push-sum algorithm for stochastic

ptimization

qi (t) = Q (xi (t) − xi (t)) for all nodes k ∈ N out

i

and j ∈ N in

i

do send qi (t) and yi (t) to k and receive qj (t) and yj (t) from j.

xj (t + 1) =

xj (t) + qj (t) end for vi (t + 1) = xi (t) − xi (t + 1) +

j∈N in i

wij xj (t + 1) yi (t + 1) =

j∈N in i

wij yj (t) zi (t + 1) = vi (t + 1)/yi (t + 1) xi (t + 1) = vi (t + 1) − α(t + 1)∇Fi (zi (t + 1)) 9 / 30

SLIDE 10

Proposed Algorithm: Quantized Push-sum

We propose the quantized Push-sum algorithm for stochastic

ptimization

qi (t) = Q (xi (t) − xi (t)) for all nodes k ∈ N out

i

and j ∈ N in

i

do send qi (t) and yi (t) to k and receive qj (t) and yj (t) from j.

xj (t + 1) =

xj (t) + qj (t) end for vi (t + 1) = xi (t) − xi (t + 1) +

j∈N in i

wij xj (t + 1) yi (t + 1) =

j∈N in i

wij yj (t) zi (t + 1) = vi (t + 1)/yi (t + 1) xi (t + 1) = vi (t + 1) − α(t + 1)∇Fi (zi (t + 1))

xj(t) is stored in all out-neighbors of node j

10 / 30

SLIDE 11

Proposed Algorithm: Quantized Push-sum

We propose the quantized Push-sum algorithm for stochastic

ptimization

qi (t) = Q (xi (t) − xi (t)) for all nodes k ∈ N out

i

and j ∈ N in

i

do send qi (t) and yi (t) to k and receive qj (t) and yj (t) from j.

xj (t + 1) =

xj (t) + qj (t) end for vi (t + 1) = xi (t) − xi (t + 1) +

j∈N in i

wij xj (t + 1) yi (t + 1) =

j∈N in i

wij yj (t) zi (t + 1) = vi (t + 1)/yi (t + 1) xi (t + 1) = vi (t + 1) − α(t + 1)∇Fi (zi (t + 1))

xj(t) is stored in all out-neighbors of node j
xj(t) → xj(t) therefore qj(t) → 0 (Similar to [Koloskova et
al. 2018] )

11 / 30

SLIDE 12

Assumptions

Assumptions on graph and connectivity

12 / 30

SLIDE 13

Assumptions

Assumptions on graph and connectivity Strongly connected graph and Wij ≥ 0, Wii > 0, ∀i, j ∈ [n]

13 / 30

SLIDE 14

Assumptions

Assumptions on graph and connectivity Strongly connected graph and Wij ≥ 0, Wii > 0, ∀i, j ∈ [n] Note that this results in W t − φ1′ ≤ Cλt, ∀t ≥ 1 where φ ∈ Rn, 0 < λ < 1

14 / 30

SLIDE 15

Assumptions

Assumptions on graph and connectivity Strongly connected graph and Wij ≥ 0, Wii > 0, ∀i, j ∈ [n] Note that this results in W t − φ1′ ≤ Cλt, ∀t ≥ 1 where φ ∈ Rn, 0 < λ < 1 Assumptions on local objectives Lipschitz Local Gradients,

∇fi(y) − ∇fi(x)
≤ L
y − x
, ∀x, y ∈ Rd

15 / 30

SLIDE 16

Assumptions

Assumptions on graph and connectivity Strongly connected graph and Wij ≥ 0, Wii > 0, ∀i, j ∈ [n] Note that this results in W t − φ1′ ≤ Cλt, ∀t ≥ 1 where φ ∈ Rn, 0 < λ < 1 Assumptions on local objectives Lipschitz Local Gradients,

∇fi(y) − ∇fi(x)
≤ L
y − x
, ∀x, y ∈ Rd

Bounded Stochastic Gradients, Eζi∼Di

∇Fi(x, ζi)
2

≤ D2, ∀x ∈ Rd

16 / 30

SLIDE 17

Assumptions

Assumptions on graph and connectivity Strongly connected graph and Wij ≥ 0, Wii > 0, ∀i, j ∈ [n] Note that this results in W t − φ1′ ≤ Cλt, ∀t ≥ 1 where φ ∈ Rn, 0 < λ < 1 Assumptions on local objectives Lipschitz Local Gradients,

∇fi(y) − ∇fi(x)
≤ L
y − x
, ∀x, y ∈ Rd

Bounded Stochastic Gradients, Eζi∼Di

∇Fi(x, ζi)
2

≤ D2, ∀x ∈ Rd Bounded Variance, Eζi∼Di

∇Fi(x, ζi) − ∇fi(x)
2

≤ σ2, ∀x ∈ Rd

17 / 30

SLIDE 18

Assumptions

Assumption on quantization function The quantization function Q : Rd → Rd satisfies for all x ∈ Rd, EQ

Q(x) − x
2

≤ ω2 x2 , (1) where 0 ≤ ω < 1.

18 / 30

SLIDE 19

Convergence Results (Convex objectives)

Define γ := W − I2 and C(λ, γ) :=

1

6(1+

6C2 (1−λ)2 )(1+γ2)

Theorem 1 Assume local objectives fi are convex for all i ∈ [n]. By choosing ω ≤ C(λ, γ) and α =

√n 8L √ T , for all T ≥ 1, it holds that,

E f

1

T

t=1

zi(t + 1)

− f ⋆ = O
1

√ nT

19 / 30

SLIDE 20

Convergence Results (Convex objectives)

Define γ := W − I2 and C(λ, γ) :=

1

6(1+

6C2 (1−λ)2 )(1+γ2)

Theorem 1 Assume local objectives fi are convex for all i ∈ [n]. By choosing ω ≤ C(λ, γ) and α =

√n 8L √ T , for all T ≥ 1, it holds that,

E f

1

T

t=1

zi(t + 1)

− f ⋆ = O
1

√ nT

Time average of local parameters zi converges to the exact

solution!

20 / 30

SLIDE 21

Convergence Results (Convex objectives)

Define γ := W − I2 and C(λ, γ) :=

1

6(1+

6C2 (1−λ)2 )(1+γ2)

Theorem 1 Assume local objectives fi are convex for all i ∈ [n]. By choosing ω ≤ C(λ, γ) and α =

√n 8L √ T , for all T ≥ 1, it holds that,

E f

1

T

t=1

zi(t + 1)

− f ⋆ = O
1

√ nT

Time average of local parameters zi converges to the exact

solution! The convergence rate is the same as the case of undirected graphs with exact communication (e.g. [Yuan et al. 2016])

21 / 30

SLIDE 22

Convergence Results (Convex objectives)

Define γ := W − I2 and C(λ, γ) :=

1

6(1+

6C2 (1−λ)2 )(1+γ2)

Theorem 1 Assume local objectives fi are convex for all i ∈ [n]. By choosing ω ≤ C(λ, γ) and α =

√n 8L √ T , for all T ≥ 1, it holds that,

E f

1

T

t=1

zi(t + 1)

− f ⋆ = O
1

√ nT

Time average of local parameters zi converges to the exact

solution! The convergence rate is the same as the case of undirected graphs with exact communication (e.g. [Yuan et al. 2016]) Error is proportional to 1/√n

22 / 30

SLIDE 23

Convergence Results (Non-Convex objectives)

Theorem 2 Let ω ≤ C(λ, γ) and α =

√n L √ T . Then after sufficiently large

number of iterations, (T ≥ 4n), it holds that 1 T

T

t=1

E

∇f
1

n

i=1

xi(t)

2

= O

1

√ nT

23 / 30

SLIDE 24

Convergence Results (Non-Convex objectives)

Theorem 2 Let ω ≤ C(λ, γ) and α =

√n L √ T . Then after sufficiently large

number of iterations, (T ≥ 4n), it holds that 1 T

T

t=1

E

∇f
1

n

i=1

xi(t)

2

= O

1

√ nT

Average of local parameters xi(t) converges a stationary

point!

24 / 30

SLIDE 25

Convergence Results (Non-Convex objectives)

Theorem 2 Let ω ≤ C(λ, γ) and α =

√n L √ T . Then after sufficiently large

number of iterations, (T ≥ 4n), it holds that 1 T

T

t=1

E

∇f
1

n

i=1

xi(t)

2

= O

1

√ nT

Average of local parameters xi(t) converges a stationary

point! Again, the convergence rate is the same as the case of undirected graphs with exact communication(e.g. [Lian et al. 2017])

25 / 30

SLIDE 26

Convergence Results (Non-Convex objectives)

Theorem 2 Let ω ≤ C(λ, γ) and α =

√n L √ T . Then after sufficiently large

number of iterations, (T ≥ 4n), it holds that 1 T

T

t=1

E

∇f
1

n

i=1

xi(t)

2

= O

1

√ nT

Average of local parameters xi(t) converges a stationary

point! Again, the convergence rate is the same as the case of undirected graphs with exact communication(e.g. [Lian et al. 2017]) Error is proportional to 1/√n

26 / 30

SLIDE 27

Numerical Experiments

f (x) =

1 2nm

n

i=1

m

j=1

x − ζi

j

2

, Data-set size=100, mini-batch size =1, dimension= 256, n=10.

2 4 6 8 10 105 10-1 100 101

5x speedup in communication time.

27 / 30

SLIDE 28

Numerical Experiments

Neural network with one hidden layer with 10 hidden units Mini-batch size = 10 (Left) & 100 (Right), n = 10

1 2 3 4 5 6 107 1.5 2 2.5 3 3.5

(a) MNIST dataset

0.5 1 1.5 2 2.5 3 3.5 10 8 0.95 1 1.1 1.2 1.3

(b) CIFAR-10 dataset

5x speed up in communication time.

28 / 30

SLIDE 29

Conclusion

We proposed the quantized push-sum algorithm for collaborative optimization. The proposed algorithm converges with optimal convergence rates w.r.t. vanilla push-sum protocol.

29 / 30

SLIDE 30

Conclusion

We proposed the quantized push-sum algorithm for collaborative optimization. The proposed algorithm converges with optimal convergence rates w.r.t. vanilla push-sum protocol. Interesting future directions: Communication-efficient algorithms for collaborative optimization with “asynchrony” or “periodic averaging”.

30 / 30