Surrogate Losses for Online Learning of Stepsizes in Stochastic - - PowerPoint PPT Presentation

surrogate losses for online learning of stepsizes in
SMART_READER_LITE
LIVE PREVIEW

Surrogate Losses for Online Learning of Stepsizes in Stochastic - - PowerPoint PPT Presentation

Surrogate Losses for Online Learning of Stepsizes in Stochastic Non-Convex Optimization Zhenxun Zhuang 1 , Ashok Cutkosky 2 , Francesco Orabona 1 , 3 1 Department of Computer Science, Boston University 2 Google 3 Department of Electrical &


slide-1
SLIDE 1

Surrogate Losses for Online Learning of Stepsizes in Stochastic Non-Convex Optimization Zhenxun Zhuang1, Ashok Cutkosky2, Francesco Orabona1,3

1Department of Computer Science, Boston University 2Google 3Department of Electrical & Computer Engineering, Boston University

1 / 10

slide-2
SLIDE 2

Convex vs. Non-Convex Functions

A Convex Function A Non-Convex Function Stationary points: ∇f (x) = 0

2 / 10

slide-3
SLIDE 3

Gradient Descent vs. Stochastic Gradient Descent

Gradient Descent: xt+1 = xt − ηt∇f (xt) SGD: xt+1 = xt − ηtg(xt, ξt) with Et[g(xt, ξt)] = ∇f (xt)

3 / 10

slide-4
SLIDE 4

Curse of Constant Stepsize

  • Ghadimi & Lan (2013): running SGD on M-smooth functions with

η ≤

1 M and assuming Et

  • g(xt, ξt) − ∇f (xt)2

≤ σ2 yields E[∇f (xi)2] ≤ O f (x1) − f ⋆ ηT + ησ2

  • .
  • Ward et al. (2018) and Li & Orabona (2019) eliminated the need to

know f ⋆ and σ for getting optimal rate by AdaGrad global stepsizes.

4 / 10

slide-5
SLIDE 5

Transform Non-Convexity to Convexity by Surrogate Losses

When the objective function is M-smooth, drawing two independent stochastic gradients in each round of SGD, we have (assume for now ηt only depends on past gradients) :

E [f (xt+1) − f (xt)] ≤ E

  • ∇f (xt), xt+1 − xt + M

2 xt+1 − xt2

  • = E
  • ∇f (xt), −ηtg(xt, ξt) + M

2 η2

t g(xt, ξt)2

  • = E
  • −ηtg(xt, ξt), g(xt, ξ′

t) + Mη2 t

2 g(xt, ξt)2

  • .

5 / 10

slide-6
SLIDE 6

Transform Non-Convexity to Convexity by Surrogate Losses

We define the surrogate loss for f at round t as ℓt(η) −ηg(xt, ξt), g(xt, ξ′

t) + Mη2

2 g(xt, ξt)2 . The inequality of last page becomes E [f (xt+1) − f (xt)] ≤ E [ℓt(ηt)] , which, after summing from t = 1 to T gives us: f ⋆ − f (x1) ≤

T

  • t=1

E [ℓt(ηt) − ℓt(η)]

  • Regret of ηt wrt optimal η

+

T

  • t=1

E [ℓt(η)]

  • Cumulative loss of optimal η

.

6 / 10

slide-7
SLIDE 7

SGD with Online Learning

Algorithm 1 Stochastic Gradient Descent with Online Learning (SGDOL)

1: Input: x1 ∈ X, M, an online learning algorithm A 2: for t = 1, 2, . . . , T do 3:

Compute ηt by running A on ℓi(η) = −ηg(xi, ξi), g(xi, ξ′

i)+ Mη2 2 g(xi, ξi)2,

i = 1, . . . , t−1

4:

Receive two independent unbiased estimates

  • f

∇f (xt): g(xt, ξt), g(xt, ξ′

t)

5:

Update xt+1 = xt − ηtgt

6: end for 7: Output: uniformly randomly choose a xk from x1, . . . , xT.

7 / 10

slide-8
SLIDE 8

Main Theorem

Theorem 1: Assume some conditions, and make some choice of the online learning algorithm in Algorithm 1, for a smooth function and an uniformly randomly picked xk from x1, . . . , xT, we have: E

  • ∇f (xk)2

≤ ˜ O 1 T + σ √ T

  • ,

where ˜ O hides some logarithmic factors.

8 / 10

slide-9
SLIDE 9

Classification Problem

Objective Function:

1 m

m

i=1 φ(a⊤ i x − yi) with φ(θ) = θ2 1+θ2 on the

adult (a9a) training dataset.

9 / 10

slide-10
SLIDE 10

10 / 10