[PPT] - Communication trade-offs for synchronized distributed SGD with large PowerPoint Presentation

SLIDE 1

Communication trade-offs for synchronized distributed SGD with large step size

Aymeric DIEULEVEUT

EPLF, MLO

17 november 2017 Joint work with Kumar Kshitij Patel.

1

SLIDE 2

Outline

1. Stochastic gradient descent - supervised machine

learning - setting, assumptions and proof techniques

2. Synchronized distributed SGD - from mini-batch

averaging to model averaging

3. Optimality of Local-SGD.

2

SLIDE 3

Stochastic Gradient Descent

◮ Goal:

min

θ∈Rd F(θ)

given unbiased gradient estimates gn

◮ θ⋆ := argminRd F(θ).

θ∗

3

SLIDE 4

Stochastic Gradient Descent

◮ Goal:

min

θ∈Rd F(θ)

given unbiased gradient estimates gn

◮ θ⋆ := argminRd F(θ). ◮ Key algorithm: Stochastic Gradient Descent (SGD) (Robbins

and Monro, 1951): θk = θk−1 − ηk gk(θk−1)

◮ E[gk(θk−1)|Fk−1] = F ′(θk−1) for a filtration (Fk)k≥0, θk is Fk

measurable.

3

θ∗ θ0

SLIDE 5

Stochastic Gradient Descent

◮ Goal:

min

θ∈Rd F(θ)

given unbiased gradient estimates gn

◮ θ⋆ := argminRd F(θ).

θ∗ θ0 θn θ1

◮ Key algorithm: Stochastic Gradient Descent (SGD) (Robbins

and Monro, 1951): θk = θk−1 − ηk gk(θk−1)

◮ E[gk(θk−1)|Fk−1] = F ′(θk−1) for a filtration (Fk)k≥0, θk is Fk

measurable.

3

SLIDE 6

Supervised Machine Learning

◮ We define the risk (generalization error) as

R(θ) := Eρ [ℓ(Y , θ, Φ(X))] .

◮ Empirical risk (or training error):

ˆ R(θ) = 1 n

n

i=1

ℓ(yi, θ, Φ(xi)).

4

SLIDE 7

Supervised Machine Learning

◮ We define the risk (generalization error) as

R(θ) := Eρ [ℓ(Y , θ, Φ(X))] .

◮ Empirical risk (or training error):

ˆ R(θ) = 1 n

n

i=1

ℓ(yi, θ, Φ(xi)).

◮ For example, least-squares regression:

min

θ∈Rd

1 2n

n

i=1
yi − θ, Φ(xi)

2 + µΩ(θ),

◮ and logistic regression:

min

θ∈Rd

1 n

n

i=1

log

1 + exp(−yiθ, Φ(xi))
+

µΩ(θ).

4

SLIDE 8

Polyak Ruppert averaging

5

SLIDE 9

Polyak Ruppert averaging

Introduced by Polyak and Juditsky (1992) and Ruppert (1988): ¯ θn = 1 n + 1

n

k=0

θk.

◮ off line averaging reduces the noise effect.

6

θ∗ θ0 θ1 θn θ1

SLIDE 10

Polyak Ruppert averaging

Introduced by Polyak and Juditsky (1992) and Ruppert (1988): ¯ θn = 1 n + 1

n

k=0

θk.

θ∗ θ0 θ1 θn θn θ1 θ2

◮ off line averaging reduces the noise effect. ◮ on line computing: ¯

θn+1 =

1 n+1θn+1 + n n+1 ¯

θn.

6

SLIDE 11

Assumptions

Goal: min

θ

F(θ) . Recursion: θk = θk−1 − ηk gk(θk−1) A1 [Strong convexity] The function F is strongly-convex with convexity constant µ > 0.

7

SLIDE 12

Assumptions

Goal: min

θ

F(θ) . Recursion: θk = θk−1 − ηk gk(θk−1) A1 [Strong convexity] The function F is strongly-convex with convexity constant µ > 0. A2 [Smoothness and regularity] The function F is three times continuously differentiable with second and third uniformly bounded derivatives: supθ∈Rd

F (2)(θ)
< L,

and supθ∈Rd

F (3)(θ)
< M. Especially F is L-smooth.

7

SLIDE 13

Assumptions

Goal: min

θ

F(θ) . Recursion: θk = θk−1 − ηk gk(θk−1) A1 [Strong convexity] The function F is strongly-convex with convexity constant µ > 0. A2 [Smoothness and regularity] The function F is three times continuously differentiable with second and third uniformly bounded derivatives: supθ∈Rd

F (2)(θ)
< L,

and supθ∈Rd

F (3)(θ)
< M. Especially F is L-smooth.

Or: Q1 [Quadratic function] There exists a positive definite matrix Σ ∈ Rd×d, such that the function F is the quadratic function θ → Σ1/2(θ − θ⋆)2/2,

7

SLIDE 14

Which step size would you use?

Smooth functions. ηk ≡ η0 ηk = 1/ √ k ηk = 1/(µk) Convex Strongly Convex Quadratic

8

SLIDE 15

Classical bound: Lyapunov approach

E

||θk+1 − θ⋆||2|Fk
≤ E
||θk − θ⋆||2

− 2ηk

F ′(θk), θk − θ⋆

+ η2

k||gk(θk)||2

≤ E

||θk − θ⋆||2

− 2ηk(1 − ηkL)

F ′(θk), θk − θ⋆

+ η2

k||gk(θ⋆)||2

ηk(F(θk) − F(θ⋆)) ≤ (1 − ηkµ)E

||θk − θ⋆||2

− E

||θk+1 − θ⋆||2|Fk
+ η2

k||gk(θ⋆)||2 9

SLIDE 16

Classical bound: Lyapunov approach

E

||θk+1 − θ⋆||2|Fk
≤ E
||θk − θ⋆||2

− 2ηk

F ′(θk), θk − θ⋆

+ η2

k||gk(θk)||2

≤ E

||θk − θ⋆||2

− 2ηk(1 − ηkL)

F ′(θk), θk − θ⋆

+ η2

k||gk(θ⋆)||2

ηk(F(θk) − F(θ⋆)) ≤ (1 − ηkµ)E

||θk − θ⋆||2

− E

||θk+1 − θ⋆||2|Fk
+ η2

k||gk(θ⋆)||2

Conclusion: with ηk =

1 µk , telescopic sum + Jensen:

E

F(¯

θk) − F(θ⋆)

≤ O(1/µk).

9

SLIDE 17

Trivial case: decaying step sizes are not that great !

Consider least squares: yi = θ⋆⊤xi + εi, εi

i.i.d.

∼ N (0, σ2).

10

SLIDE 18

Trivial case: decaying step sizes are not that great !

Consider least squares: yi = θ⋆⊤xi + εi, εi

i.i.d.

∼ N (0, σ2). Start with θ0 = θ⋆: Then: ¯ θk − θ⋆ = 1 k

k

i=1

Mk

i η2 i εi.

Even with large step size η2

i ≡ η, CLT is enough to control

that !

10

SLIDE 19

Trivial case: decaying step sizes are not that great !

Consider least squares: yi = θ⋆⊤xi + εi, εi

i.i.d.

∼ N (0, σ2). Start with θ0 = θ⋆: Then: ¯ θk − θ⋆ = 1 k

k

i=1

Mk

i η2 i εi.

Even with large step size η2

i ≡ η, CLT is enough to control

that ! Tight control is much easier on the stochastic process θk − θ⋆ than through the “Lyapunov approach”.

10

SLIDE 20

Other proof: introduce decomposition

Original proof of averaging in Polyak and Juditsky (1992). ηkF ′′(θ⋆)(θk−1 − θ⋆) = θk−1 − θk −ηk

gk(θk−1) − F ′(θk−1)
+ ηk
F ′(θk−1) − F ′′(θ⋆)(θk−1 − θ⋆)
.

11

SLIDE 21

Other proof: introduce decomposition

Original proof of averaging in Polyak and Juditsky (1992). ηkF ′′(θ⋆)(θk−1 − θ⋆) = θk−1 − θk −ηk

gk(θk−1) − F ′(θk−1)
+ ηk
F ′(θk−1) − F ′′(θ⋆)(θk−1 − θ⋆)
.

Thus, for ηk ≡ η F ′′(θ⋆) ¯ θK − θ⋆ = θK − θ0 ηK − 1 K

K

k=1
gk(θk−1) − F ′(θk−1)
+ 1

K

k=1
F ′(θk−1) − F ′′(θ⋆)(θk−1 − θ⋆)
.

11

SLIDE 22

Other proof: introduce decomposition

Original proof of averaging in Polyak and Juditsky (1992). ηkF ′′(θ⋆)(θk−1 − θ⋆) = θk−1 − θk −ηk

gk(θk−1) − F ′(θk−1)
+ ηk
F ′(θk−1) − F ′′(θ⋆)(θk−1 − θ⋆)
.

Thus, for ηk ≡ η F ′′(θ⋆) ¯ θK − θ⋆ = θK − θ0 ηK − 1 K

K

k=1
gk(θk−1) − F ′(θk−1)
+ 1

K

k=1
F ′(θk−1) − F ′′(θ⋆)(θk−1 − θ⋆)
.

Initial condition - Noise - Non quadratic residual

11

SLIDE 23

Other proof: introduce decomposition

Original proof of averaging in Polyak and Juditsky (1992). ηkF ′′(θ⋆)(θk−1 − θ⋆) = θk−1 − θk −ηk

gk(θk−1) − F ′(θk−1)
+ ηk
F ′(θk−1) − F ′′(θ⋆)(θk−1 − θ⋆)
.

Thus, for ηk ≡ η F ′′(θ⋆) ¯ θK − θ⋆ = θK − θ0 ηK − 1 K

K

k=1
gk(θk−1) − F ′(θk−1)
+ 1

K

k=1
F ′(θk−1) − F ′′(θ⋆)(θk−1 − θ⋆)
.

Initial condition - Noise - Non quadratic residual tight control of ||F ′′(θ⋆) ¯ θK − θ⋆ ||.

11

SLIDE 24

Other proof: introduce decomposition

Original proof of averaging in Polyak and Juditsky (1992). ηkF ′′(θ⋆)(θk−1 − θ⋆) = θk−1 − θk −ηk

gk(θk−1) − F ′(θk−1)
+ ηk
F ′(θk−1) − F ′′(θ⋆)(θk−1 − θ⋆)
.

Thus, for ηk ≡ η F ′′(θ⋆) ¯ θK − θ⋆ = θK − θ0 ηK − 1 K

K

k=1
gk(θk−1) − F ′(θk−1)
+ 1

K

k=1
F ′(θk−1) − F ′′(θ⋆)(θk−1 − θ⋆)
.

Initial condition - Noise - Non quadratic residual tight control of ||F ′′(θ⋆) ¯ θK − θ⋆ ||. Correct control of the noise for smooth and strongly convex All step sizes ηn = Cn−α with α ∈ (1/2, 1) lead to O(n−1). LMS algorithm: constant step-size → statistical optimality.

11

SLIDE 25

Problem: dependence in µ

Possible to recover convergence in function values: F(¯ θK) − F(θ⋆) ≤ L 2||θK − θ⋆||2 ≤ L 2µ2 ||F ′′(θ⋆) ¯ θK − θ⋆ ||2

12

SLIDE 26

Problem: dependence in µ

Possible to recover convergence in function values: F(¯ θK) − F(θ⋆) ≤ L 2||θK − θ⋆||2 ≤ L 2µ2 ||F ′′(θ⋆) ¯ θK − θ⋆ ||2 However:

◮ Ok for least squares regression (with some more work

(D´ efossez and Bach, 2015; Dieuleveut et al., 2016; Jain et al., 2016))

◮ Possible to recover tight convergence with self

concordance (Bach 2013).

12

SLIDE 27

Synchronized distributed optimization

1. P machines
2. C the number of communication steps ( C phases)
3. for t ∈ [C], worker p ∈ [P] performs Nt local steps

13

SLIDE 28

Synchronized distributed optimization

1. P machines
2. C the number of communication steps ( C phases)
3. for t ∈ [C], worker p ∈ [P] performs Nt local steps

For any p ∈ [P], t ∈ [C], k ∈ [Nt]:

◮ θt p,k the model proposed by worker p, at phase t, after k

local iterations.

◮ θ1 p,0 = θ0. ◮

θt

p,k = θt p,k−1 − ηt kg t p,k(θt p,k−1).

13

SLIDE 29

Link with classical algorithms.

Algo. Work. Com. Phases T Local P C (N1 . . . NC) P C

t=1 Nt

Serial 1

(N)

N P-MBA P C (1, . . . , 1) PC OSA P 1 (N1) N1P

14

SLIDE 30

Link with classical algorithms.

Algo. Work. Com. Phases T Local P C (N1 . . . NC) P C

t=1 Nt

Serial 1

(N)

N P-MBA P C (1, . . . , 1) PC OSA P 1 (N1) N1P

One Shot Averaging – Mini-Batch Averaging –Local SGD

14

SLIDE 31

Aggregation steps: ˆ θ

t = 1 P

P

p=1 θt p,Nt.

At phase t + 1, every worker p ∈ [P] restarts from the averaged model: θt+1

p,0 := ˆ

θ

t.

Goal: Risk of the Polyak-Ruppert averaged iterate: θ

C

= 1 P C

t=1 Nt C

t=1

P

p=1

Nt

k=1

θt

p,k,

15

SLIDE 32

Assumptions

A3 [Oracle on the gradient] Filtration (Ht

k)(t,k)∈[C]×[Nt]

such that for any (t, k) ∈ [C] × [Nt] and θ ∈ Rd, g t

p,k+1(θ) is a Ht k+1-measurable random variable and

E

g t

p,k+1(θ)|Ht k

= F ′(θ).

A4 [Uniformly bounded variance] E[g t

p,k(θt p,k) − F ′(θt p,k)2] ≤ σ2 ∞.

A5 [Cocoercivity of the random gradients] For any t ∈ [C], k ∈ [Nt], p ∈ [P], g t

p,k is almost surely L-co-coercive

A6 [Finite variance at the optimal point] There exists σ ≥ 0, such that for any t ∈ [C], k ∈ [Nt], p ∈ [P], E[g t

p,k(θ⋆)4] ≤ σ4.

We assume A4 OR A5 + A6

16

SLIDE 33

Error decomposition

ηt

kF ′′(θ⋆)(θt p,k−1 − θ⋆) = θt p,k−1 − θt p,k

− ηt

k

g t

p,k(θt p,k−1) − F ′(θt p,k−1)

+ ηt

k

F ′(θt

p,k−1) − F ′′(θ⋆)(θt p,k−1 − θ⋆)

.

Thus:

F ′′(θ⋆)

θ

C

− θ⋆

=

1 P C

t=1 Nt C

t=1

P

p=1

Nt

k=1

θt

p,k−1 − θt p,k

ηt

k

−

g t

p,k(θt p,k−1) − F ′(θt p,k−1)

+
F ′(θt

p,k−1) − F ′′(θ⋆)(θt p,k−1 − θ⋆)

.

17

SLIDE 34

Error decomposition

ηt

kF ′′(θ⋆)(θt p,k−1 − θ⋆) = θt p,k−1 − θt p,k

− ηt

k

g t

p,k(θt p,k−1) − F ′(θt p,k−1)

+ ηt

k

F ′(θt

p,k−1) − F ′′(θ⋆)(θt p,k−1 − θ⋆)

.

Thus:

F ′′(θ⋆)

θ

C

− θ⋆

=

1 P C

t=1 Nt C

t=1

P

p=1

Nt

k=1

θt

p,k−1 − θt p,k

ηt

k

−

g t

p,k(θt p,k−1) − F ′(θt p,k−1)

+
F ′(θt

p,k−1) − F ′′(θ⋆)(θt p,k−1 − θ⋆)

.

Noise: Additive + (Multiplicative ∝ ||θt

p,k − θ⋆||2)

Residual: ∝ ||θt

p,k − θ⋆||2 17

SLIDE 35

Results MBA - OSA

Assume A1,2,3,5,6, and ηt

k ≡ η for any (t, k) ∈ [C] × [Nt].

Proposition (Mini-batch Averaging)

For any t ∈ [C], E

ˆ

θ

t − θ⋆

2

≤ (1 − ηµ)t θ0 − θ⋆2 + 2σ2η P 1 − (1 − ηµ)t µ , E

θ

C

− θ⋆

2

F ′′(θ⋆)

θ0 − θ⋆

2 η2C 2 Qbias + σ2 T

1 + Q1,var(C)

P + Q2,var(C) P2

.

18

SLIDE 36

Results MBA - OSA

Assume A1,2,3,5,6, and ηt

k ≡ η for any (t, k) ∈ [C] × [Nt].

Proposition (Mini-batch Averaging)

For any t ∈ [C], E

ˆ

θ

t − θ⋆

2

≤ (1 − ηµ)t θ0 − θ⋆2 + 2σ2η P 1 − (1 − ηµ)t µ , E

θ

C

− θ⋆

2

F ′′(θ⋆)

θ0 − θ⋆

2 η2C 2 Qbias + σ2 T

1 + Q1,var(C)

P + Q2,var(C) P2

.

Proposition (One-shot Averaging)

For any p ∈ [P], t = 1, k ∈ [N], E

θ1

p,k − θ⋆

2

≤ (1 − ηµ)k θ0 − θ⋆2 + 2σ2η 1 − (1 − ηµ)k µ , E

(θ

C

− θ⋆)

2

F ′′(θ⋆)

θ0 − θ⋆

2 η2N2 Qbias + σ2 T

1 + Q1,var(N) + Q2,var(N)
18

SLIDE 37

With Qbias = 1 + M2η µ

θ0 − θ⋆
2

+ L2η µP , Q1,var(X) = L2η µ + P Xηµ, Q2,var(X) = M2XPη2σ2 µ2 .

19

SLIDE 38

With Qbias = 1 + M2η µ

θ0 − θ⋆
2

+ L2η µP , Q1,var(X) = L2η µ + P Xηµ, Q2,var(X) = M2XPη2σ2 µ2 .

◮ Asymptotically equivalent for P constant. ◮ Non asymptotic result (vs Godichon and Saadane

(2017))

◮ Proposition 1 corrects Bach 2011, with Needel 2014

remark (see also Dieuleveut Durmus 2017).

◮ “the noise is the noise and SGD doesn’t care” (for

asynchronous SGD, (Duchi et al., 2015))

◮ Extension to the on-line setting possible

19

SLIDE 39

Bridging the gap: convergence of Local-SGD: simple case

Assume Q1, A3, A4. For p ∈ [P], t ∈ [C], k ∈ [Nt],

E

ˆ

θ

t−1 − θ⋆

2

≤ (1 − ηµ)Nt−1

1

θ0 − θ⋆2 + σ2

∞η

P 1 − (1 − ηµ)Nt−1

1

µ E

θt

p,k − θ⋆

2 ≤ (1 − ηµ)Nt−1

1

+k θ0 − θ⋆2

+ σ2

∞η

     1 − (1 − ηµ)Nt−1

1

Pµ

long term reduced variance

+ 1 − (1 − ηµ)k µ

local iteration variance

     . Corollary: If for all t ∈ [C], Nt ≤

1 µηP , then the second order moment of

θt

p,k admits the same upper bound as the mini-batch iterate ˆ

θ

Nt−1

1

+k MB

up to a factor of 2. As a consequence, Local-SGD performs optimally.

SLIDE 40

Bridging the gap: convergence of Local-SGD: simple case

Assume Q1, A3, A4. For p ∈ [P], t ∈ [C], k ∈ [Nt],

E

ˆ

θ

t−1 − θ⋆

2

≤ (1 − ηµ)Nt−1

1

θ0 − θ⋆2 + σ2

∞η

P 1 − (1 − ηµ)Nt−1

1

µ E

θt

p,k − θ⋆

2 ≤ (1 − ηµ)Nt−1

1

+k θ0 − θ⋆2

+ σ2

∞η

     1 − (1 − ηµ)Nt−1

1

Pµ

long term reduced variance

+ 1 − (1 − ηµ)k µ

local iteration variance

     . Corollary: If for all t ∈ [C], Nt ≤

1 µηP , then the second order moment of

θt

p,k admits the same upper bound as the mini-batch iterate ˆ

θ

Nt−1

1

+k MB

up to a factor of 2. As a consequence, Local-SGD performs optimally.

20

SLIDE 41

Example

With constant number of local steps Nt = N, and learning rate η =

c √ NC in order to obtain an optimal O( σ2 T ) parallel

convergence rate, local-SGD can communicate O(

√ NC Pµ )

times less as compared to mini-batch averaging.

21

SLIDE 42

Quadratic + additive noise ↔ too simple and un-realistic

◮ Least square regression: quadratic + multiplicative noise

(Q1, A3, A5, A6)

◮ Logistic regression: non quadratic + uniformly bounded

variance (A1, A2, A3, A4) Key lemmas: control how the restart point of each phase differs from its mini-batch equivalent.

22

SLIDE 43

Quadratic + additive noise ↔ too simple and un-realistic

◮ Least square regression: quadratic + multiplicative noise

(Q1, A3, A5, A6)

◮ Logistic regression: non quadratic + uniformly bounded

variance (A1, A2, A3, A4) Key lemmas: control how the restart point of each phase differs from its mini-batch equivalent.

Theorem

Under either of the following sets of assumptions, the convergence of the Polyak Ruppert iterate θ

C

is as good as in the mini-batch case, up to a constant:

1. Assume Q1, A3, A5, A6, and for any t ∈ [C], Nt ≤

1 µηP

and µη2Nt

1 = O(1).

22

SLIDE 44

Quadratic + additive noise ↔ too simple and un-realistic

◮ Least square regression: quadratic + multiplicative noise

(Q1, A3, A5, A6)

◮ Logistic regression: non quadratic + uniformly bounded

variance (A1, A2, A3, A4) Key lemmas: control how the restart point of each phase differs from its mini-batch equivalent.

Theorem

Under either of the following sets of assumptions, the convergence of the Polyak Ruppert iterate θ

C

is as good as in the mini-batch case, up to a constant:

1. Assume Q1, A3, A5, A6, and for any t ∈ [C], Nt ≤

1 µηP

and µη2Nt

1 = O(1).

2. Assume A1, A2, A3, A4, and for any t ∈ [C],

Nt ≤ inf

1

ηPME

ˆ

θ

t−θ⋆

,

1 µηP

.

22

SLIDE 45

Conclusion

◮ Non asymptotic analysis of Local-SGD ◮ With “large” step sizes. ◮ better understanding of communication trade-offs →

lower bounds on communication frequency

◮ Similar results for the on-line case (a bit faster, and

much more painful for the eyes).

23

SLIDE 46

Conclusion

◮ Non asymptotic analysis of Local-SGD ◮ With “large” step sizes. ◮ better understanding of communication trade-offs →

lower bounds on communication frequency

◮ Similar results for the on-line case (a bit faster, and

much more painful for the eyes). Directions:

◮ Improve to optimal rates in terms of µ with self

concordance

◮ Proving that those bounds are tight (dangerous to

compare upper bounds!!)

23

SLIDE 47

Agarwal, A., Negahban, S., and Wainwright, M. J. (2012). Fast global convergence

f gradient methods for high-dimensional statistical recovery. Ann. Statist.,

40(5):2452–2482. Bach, F. and Moulines, E. (2011). Non-asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, pages 451–459, USA. Curran Associates Inc. D´ efossez, A. and Bach, F. (2015). Averaged least-mean-squares: bias-variance trade-offs and optimal sampling distributions. In Proceedings of the International Conference on Artificial Intelligence and Statistics, (AISTATS). Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L. (2012). Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(Jan):165–202. Dieuleveut, A., Flammarion, N., and Bach, F. (2016). Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression. ArXiv e-prints. Duchi, J. C., Chaturapruek, S., and R´ e, C. (2015). Asynchronous stochastic convex

ptimization. ArXiv e-prints.

Godichon, A. B. and Saadane, S. (2017). On the rates of convergence of Parallelized Averaged Stochastic Gradient Algorithms. ArXiv e-prints. Jain, P., Kakade, S. M., Kidambi, R., Netrapalli, P., and Sidford, A. (2016). Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging. ArXiv e-prints. Lacoste-Julien, S., Schmidt, M., and Bach, F. (2012). A simpler approach to

btaining an O(1/t) rate for the stochastic projected subgradient method. ArXiv

e-prints 1212.2002.

23

SLIDE 48

Li, M., Zhang, T., Chen, Y., and Smola, A. J. (2014). Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 661–670. ACM. Lin, T., Stich, S. U., and Jaggi, M. (2018). Don’t Use Large Mini-Batches, Use Local SGD. ArXiv e-prints. Nemirovsky, A. S. and Yudin, D. B. (1983). Problem complexity and method efficiency in optimization. A Wiley-Interscience Publication. John Wiley & Sons, Inc., New York. Translated from the Russian and with a preface by E. R. Dawson, Wiley-Interscience Series in Discrete Mathematics. Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855. Rakhlin, A., Shamir, O., Sridharan, K., et al. (2012). Making gradient descent

ptimal for strongly convex stochastic optimization. In ICML. Citeseer.

Robbins, H. and Monro, S. (1951). A stochastic approxiation method. The Annals

f mathematical Statistics, 22(3):400–407.

Ruppert, D. (1988). Efficient estimations from a slowly convergent Robbins-Monro

process. Technical report, Cornell University Operations Research and Industrial

Engineering. Stich, S. U. (2018). Local SGD Converges Fast and Communicates Little. ArXiv e-prints. Tak´ aˇ c, M., Bijral, A., Richt´ arik, P., and Srebro, N. (2013). Mini-batch primal and dual methods for svms. In Proceedings of the 30th International Conference on International Conference on Machine Learning-Volume 28, pages III–1022.

JMLR. org.

23

SLIDE 49

Zhang, J., De Sa, C., Mitliagkas, I., and R´ e, C. (2016). Parallel SGD: When does averaging help? ArXiv e-prints. Zhang, Y., Wainwright, M. J., and Duchi, J. C. (2012). Communication-efficient algorithms for statistical optimization. In Advances in Neural Information Processing Systems, pages 1502–1510. Zinkevich, M., Weimer, M., Li, L., and Smola, A. J. (2010). Parallelized stochastic gradient descent. In Advances in neural information processing systems, pages 2595–2603.

23