Quantized Decentralized Stochastic Learning over Directed Graphs - - PowerPoint PPT Presentation

quantized decentralized stochastic learning over directed
SMART_READER_LITE
LIVE PREVIEW

Quantized Decentralized Stochastic Learning over Directed Graphs - - PowerPoint PPT Presentation

Quantized Decentralized Stochastic Learning over Directed Graphs Hossein Taheri 1 Joint work with Aryan Mokhtari 2 , Hamed Hassani 3 , and Ramtin Pedarsani 1 1 University of California, Santa Barbara 2 University of Texas, Austin 3 University of


slide-1
SLIDE 1

Quantized Decentralized Stochastic Learning

  • ver Directed Graphs

Hossein Taheri1

Joint work with Aryan Mokhtari2, Hamed Hassani3, and Ramtin Pedarsani1

1University of California, Santa Barbara 2University of Texas, Austin 3University of Pennsylvania

Thirty-seventh International Conference on Machine Learning (ICML), 2020

1 / 30

slide-2
SLIDE 2

Decentralized Optimization

Decentralized Stochastic Learning involves multiple agents or nodes that collect data, and want to learn an ML model collaboratively.

2 / 30

slide-3
SLIDE 3

Decentralized Optimization

Decentralized Stochastic Learning involves multiple agents or nodes that collect data, and want to learn an ML model collaboratively. Applications including federated learning, multi-agent robotics systems, sensor networks, etc.

3 / 30

slide-4
SLIDE 4

Decentralized Optimization

Decentralized Stochastic Learning involves multiple agents or nodes that collect data, and want to learn an ML model collaboratively. Applications including federated learning, multi-agent robotics systems, sensor networks, etc. In many cases, communication links are asymmetric due to failures and bottlenecks and communication is done over a directed graph [Tsianos et al. 2012, Nedic et al. 2014, Assran et al. 2020].

4 / 30

slide-5
SLIDE 5

This Talk

Link failure: Nodes communicate over a directed graph High communication cost: Nodes communicate compressed information Q(x) Compression operator Q : Rd → Rd

5 / 30

slide-6
SLIDE 6

Introduction: Push-sum Algorithm

Decentralized optimization over directed graphs with exact communication:      xi(t + 1) = n

j=1 wij xj(t) − α(t)∇fi (zi(t))

yi(t + 1) = n

j=1 wij yj(t)

zi(t + 1) = xi(t + 1)/yi(t + 1)

6 / 30

slide-7
SLIDE 7

Introduction: Push-sum Algorithm

Decentralized optimization over directed graphs with exact communication:      xi(t + 1) = n

j=1 wij xj(t) − α(t)∇fi (zi(t))

yi(t + 1) = n

j=1 wij yj(t)

zi(t + 1) = xi(t + 1)/yi(t + 1) [Nedic et al. 2014] prove that for convex, Lipschitz objectives and α(t) = O(1/ √ T) ⇒ f ( zi(T)) − f ⋆ = O(1/ √ T),

  • zi(T) = 1

T

T

t=1 zi(t)

7 / 30

slide-8
SLIDE 8

Introduction: Push-sum Algorithm

Decentralized optimization over directed graphs with exact communication:      xi(t + 1) = n

j=1 wij xj(t) − α(t)∇fi (zi(t))

yi(t + 1) = n

j=1 wij yj(t)

zi(t + 1) = xi(t + 1)/yi(t + 1) [Nedic et al. 2014] prove that for convex, Lipschitz objectives and α(t) = O(1/ √ T) ⇒ f ( zi(T)) − f ⋆ = O(1/ √ T),

  • zi(T) = 1

T

T

t=1 zi(t)

How can we incorporate quantized message exchanging for this setting?

8 / 30

slide-9
SLIDE 9

Proposed Algorithm: Quantized Push-sum

We propose the quantized Push-sum algorithm for stochastic

  • ptimization

qi (t) = Q (xi (t) − xi (t)) for all nodes k ∈ N out

i

and j ∈ N in

i

do send qi (t) and yi (t) to k and receive qj (t) and yj (t) from j.

  • xj (t + 1) =

xj (t) + qj (t) end for vi (t + 1) = xi (t) − xi (t + 1) +

j∈N in i

wij xj (t + 1) yi (t + 1) =

j∈N in i

wij yj (t) zi (t + 1) = vi (t + 1)/yi (t + 1) xi (t + 1) = vi (t + 1) − α(t + 1)∇Fi (zi (t + 1)) 9 / 30

slide-10
SLIDE 10

Proposed Algorithm: Quantized Push-sum

We propose the quantized Push-sum algorithm for stochastic

  • ptimization

qi (t) = Q (xi (t) − xi (t)) for all nodes k ∈ N out

i

and j ∈ N in

i

do send qi (t) and yi (t) to k and receive qj (t) and yj (t) from j.

  • xj (t + 1) =

xj (t) + qj (t) end for vi (t + 1) = xi (t) − xi (t + 1) +

j∈N in i

wij xj (t + 1) yi (t + 1) =

j∈N in i

wij yj (t) zi (t + 1) = vi (t + 1)/yi (t + 1) xi (t + 1) = vi (t + 1) − α(t + 1)∇Fi (zi (t + 1))

  • xj(t) is stored in all out-neighbors of node j

10 / 30

slide-11
SLIDE 11

Proposed Algorithm: Quantized Push-sum

We propose the quantized Push-sum algorithm for stochastic

  • ptimization

qi (t) = Q (xi (t) − xi (t)) for all nodes k ∈ N out

i

and j ∈ N in

i

do send qi (t) and yi (t) to k and receive qj (t) and yj (t) from j.

  • xj (t + 1) =

xj (t) + qj (t) end for vi (t + 1) = xi (t) − xi (t + 1) +

j∈N in i

wij xj (t + 1) yi (t + 1) =

j∈N in i

wij yj (t) zi (t + 1) = vi (t + 1)/yi (t + 1) xi (t + 1) = vi (t + 1) − α(t + 1)∇Fi (zi (t + 1))

  • xj(t) is stored in all out-neighbors of node j
  • xj(t) → xj(t) therefore qj(t) → 0 (Similar to [Koloskova et
  • al. 2018] )

11 / 30

slide-12
SLIDE 12

Assumptions

Assumptions on graph and connectivity

12 / 30

slide-13
SLIDE 13

Assumptions

Assumptions on graph and connectivity Strongly connected graph and Wij ≥ 0, Wii > 0, ∀i, j ∈ [n]

13 / 30

slide-14
SLIDE 14

Assumptions

Assumptions on graph and connectivity Strongly connected graph and Wij ≥ 0, Wii > 0, ∀i, j ∈ [n] Note that this results in W t − φ1′ ≤ Cλt, ∀t ≥ 1 where φ ∈ Rn, 0 < λ < 1

14 / 30

slide-15
SLIDE 15

Assumptions

Assumptions on graph and connectivity Strongly connected graph and Wij ≥ 0, Wii > 0, ∀i, j ∈ [n] Note that this results in W t − φ1′ ≤ Cλt, ∀t ≥ 1 where φ ∈ Rn, 0 < λ < 1 Assumptions on local objectives Lipschitz Local Gradients,

  • ∇fi(y) − ∇fi(x)
  • ≤ L
  • y − x
  • , ∀x, y ∈ Rd

15 / 30

slide-16
SLIDE 16

Assumptions

Assumptions on graph and connectivity Strongly connected graph and Wij ≥ 0, Wii > 0, ∀i, j ∈ [n] Note that this results in W t − φ1′ ≤ Cλt, ∀t ≥ 1 where φ ∈ Rn, 0 < λ < 1 Assumptions on local objectives Lipschitz Local Gradients,

  • ∇fi(y) − ∇fi(x)
  • ≤ L
  • y − x
  • , ∀x, y ∈ Rd

Bounded Stochastic Gradients, Eζi∼Di

  • ∇Fi(x, ζi)
  • 2

≤ D2, ∀x ∈ Rd

16 / 30

slide-17
SLIDE 17

Assumptions

Assumptions on graph and connectivity Strongly connected graph and Wij ≥ 0, Wii > 0, ∀i, j ∈ [n] Note that this results in W t − φ1′ ≤ Cλt, ∀t ≥ 1 where φ ∈ Rn, 0 < λ < 1 Assumptions on local objectives Lipschitz Local Gradients,

  • ∇fi(y) − ∇fi(x)
  • ≤ L
  • y − x
  • , ∀x, y ∈ Rd

Bounded Stochastic Gradients, Eζi∼Di

  • ∇Fi(x, ζi)
  • 2

≤ D2, ∀x ∈ Rd Bounded Variance, Eζi∼Di

  • ∇Fi(x, ζi) − ∇fi(x)
  • 2

≤ σ2, ∀x ∈ Rd

17 / 30

slide-18
SLIDE 18

Assumptions

Assumption on quantization function The quantization function Q : Rd → Rd satisfies for all x ∈ Rd, EQ

  • Q(x) − x
  • 2

≤ ω2 x2 , (1) where 0 ≤ ω < 1.

18 / 30

slide-19
SLIDE 19

Convergence Results (Convex objectives)

Define γ := W − I2 and C(λ, γ) :=

1

  • 6(1+

6C2 (1−λ)2 )(1+γ2)

Theorem 1 Assume local objectives fi are convex for all i ∈ [n]. By choosing ω ≤ C(λ, γ) and α =

√n 8L √ T , for all T ≥ 1, it holds that,

E f

  • 1

T

T

  • t=1

zi(t + 1)

  • − f ⋆ = O
  • 1

√ nT

  • 19 / 30
slide-20
SLIDE 20

Convergence Results (Convex objectives)

Define γ := W − I2 and C(λ, γ) :=

1

  • 6(1+

6C2 (1−λ)2 )(1+γ2)

Theorem 1 Assume local objectives fi are convex for all i ∈ [n]. By choosing ω ≤ C(λ, γ) and α =

√n 8L √ T , for all T ≥ 1, it holds that,

E f

  • 1

T

T

  • t=1

zi(t + 1)

  • − f ⋆ = O
  • 1

√ nT

  • Time average of local parameters zi converges to the exact

solution!

20 / 30

slide-21
SLIDE 21

Convergence Results (Convex objectives)

Define γ := W − I2 and C(λ, γ) :=

1

  • 6(1+

6C2 (1−λ)2 )(1+γ2)

Theorem 1 Assume local objectives fi are convex for all i ∈ [n]. By choosing ω ≤ C(λ, γ) and α =

√n 8L √ T , for all T ≥ 1, it holds that,

E f

  • 1

T

T

  • t=1

zi(t + 1)

  • − f ⋆ = O
  • 1

√ nT

  • Time average of local parameters zi converges to the exact

solution! The convergence rate is the same as the case of undirected graphs with exact communication (e.g. [Yuan et al. 2016])

21 / 30

slide-22
SLIDE 22

Convergence Results (Convex objectives)

Define γ := W − I2 and C(λ, γ) :=

1

  • 6(1+

6C2 (1−λ)2 )(1+γ2)

Theorem 1 Assume local objectives fi are convex for all i ∈ [n]. By choosing ω ≤ C(λ, γ) and α =

√n 8L √ T , for all T ≥ 1, it holds that,

E f

  • 1

T

T

  • t=1

zi(t + 1)

  • − f ⋆ = O
  • 1

√ nT

  • Time average of local parameters zi converges to the exact

solution! The convergence rate is the same as the case of undirected graphs with exact communication (e.g. [Yuan et al. 2016]) Error is proportional to 1/√n

22 / 30

slide-23
SLIDE 23

Convergence Results (Non-Convex objectives)

Theorem 2 Let ω ≤ C(λ, γ) and α =

√n L √ T . Then after sufficiently large

number of iterations, (T ≥ 4n), it holds that 1 T

T

  • t=1

E

  • ∇f
  • 1

n

n

  • i=1

xi(t)

  • 2

= O

  • 1

√ nT

  • 23 / 30
slide-24
SLIDE 24

Convergence Results (Non-Convex objectives)

Theorem 2 Let ω ≤ C(λ, γ) and α =

√n L √ T . Then after sufficiently large

number of iterations, (T ≥ 4n), it holds that 1 T

T

  • t=1

E

  • ∇f
  • 1

n

n

  • i=1

xi(t)

  • 2

= O

  • 1

√ nT

  • Average of local parameters xi(t) converges a stationary

point!

24 / 30

slide-25
SLIDE 25

Convergence Results (Non-Convex objectives)

Theorem 2 Let ω ≤ C(λ, γ) and α =

√n L √ T . Then after sufficiently large

number of iterations, (T ≥ 4n), it holds that 1 T

T

  • t=1

E

  • ∇f
  • 1

n

n

  • i=1

xi(t)

  • 2

= O

  • 1

√ nT

  • Average of local parameters xi(t) converges a stationary

point! Again, the convergence rate is the same as the case of undirected graphs with exact communication(e.g. [Lian et al. 2017])

25 / 30

slide-26
SLIDE 26

Convergence Results (Non-Convex objectives)

Theorem 2 Let ω ≤ C(λ, γ) and α =

√n L √ T . Then after sufficiently large

number of iterations, (T ≥ 4n), it holds that 1 T

T

  • t=1

E

  • ∇f
  • 1

n

n

  • i=1

xi(t)

  • 2

= O

  • 1

√ nT

  • Average of local parameters xi(t) converges a stationary

point! Again, the convergence rate is the same as the case of undirected graphs with exact communication(e.g. [Lian et al. 2017]) Error is proportional to 1/√n

26 / 30

slide-27
SLIDE 27

Numerical Experiments

f (x) =

1 2nm

n

i=1

m

j=1

  • x − ζi

j

  • 2

, Data-set size=100, mini-batch size =1, dimension= 256, n=10.

2 4 6 8 10 105 10-1 100 101

5x speedup in communication time.

27 / 30

slide-28
SLIDE 28

Numerical Experiments

Neural network with one hidden layer with 10 hidden units Mini-batch size = 10 (Left) & 100 (Right), n = 10

1 2 3 4 5 6 107 1.5 2 2.5 3 3.5

(a) MNIST dataset

0.5 1 1.5 2 2.5 3 3.5 10 8 0.95 1 1.1 1.2 1.3

(b) CIFAR-10 dataset

5x speed up in communication time.

28 / 30

slide-29
SLIDE 29

Conclusion

We proposed the quantized push-sum algorithm for collaborative optimization. The proposed algorithm converges with optimal convergence rates w.r.t. vanilla push-sum protocol.

29 / 30

slide-30
SLIDE 30

Conclusion

We proposed the quantized push-sum algorithm for collaborative optimization. The proposed algorithm converges with optimal convergence rates w.r.t. vanilla push-sum protocol. Interesting future directions: Communication-efficient algorithms for collaborative optimization with “asynchrony” or “periodic averaging”.

30 / 30