An Accelerated Variance Reducing Stochastic Method with - - PowerPoint PPT Presentation

an accelerated variance reducing stochastic method with
SMART_READER_LITE
LIVE PREVIEW

An Accelerated Variance Reducing Stochastic Method with - - PowerPoint PPT Presentation

An Accelerated Variance Reducing Stochastic Method with Douglas-Rachford Splitting Jingchang Liu November 12, 2018 University of Science and Technology of China 1 Table of Contents Background Moreau Envelop and Douglas-Rachford (DR)


slide-1
SLIDE 1

An Accelerated Variance Reducing Stochastic Method with Douglas-Rachford Splitting

Jingchang Liu November 12, 2018

University of Science and Technology of China 1

slide-2
SLIDE 2

Table of Contents

Background Moreau Envelop and Douglas-Rachford (DR) Splitting Our methods Theories Experiments Conclusions Q & A

2

slide-3
SLIDE 3

Background

slide-4
SLIDE 4

Problem

Formulation

  • Regularized ERM: min

x∈Rd f (x) + h(x) := 1 n n

  • i=1

fi(x) + h(x).

  • fi : Rd → R: empirical loss of i-th sample, convex.
  • h: regularization term, convex but possibly non-smooth.
  • Examples: LASSO, sparse SVM, ℓ1, ℓ2-Logistic Regression.

Definition

  • Proximal operator: proxγ

f (x) = argminy∈Rd

  • f (y) +

1 2γ y − x2

.

  • Gradient mapping: f (x) = 1

γ (x − proxγ f (x)).

  • Subdifferential: ∂f (x) =
  • g | g T(y − x) ≤ f (y) − f (x), ∀y ∈ dom f
  • .
  • Strongly convex: f (y) ≥ f (x) + g, y − x + µ

2 y − x2.

  • L-smooth: f (y) ≤ f (x) + ∇f (x), y − x + L

2 y − x2. 3

slide-5
SLIDE 5

Related Works

Exsiting Algorithm proxγ

h(x − γ · ), where can be obtained from:

  • GD: = ∇f (x), more calculations needed in each iteration.
  • SGD: = ∇fi(x), small stepsize deduces slow convergence.
  • Variance reduction (VR): = ∇fi(x) − ∇fi(¯

x) + ∇f (x), such as SVRG, SAGA, SDCA. Accelerated Technique

  • Ill condition: L/µ, the condition number, is large.
  • Methods: Acc-SDCA, Catalyst, Mig, Point-SAGA.
  • Drawbacks: More parameters need to be tuned.

4

slide-6
SLIDE 6

Rate

Convergence Rate

  • VR stochastic methods: O ((n + L/µ) log(1/ǫ)).
  • Acc-SDCA, Mig, Point-SAGA: O((n +
  • nL/µ) log(1/ǫ)).
  • When L/µ ≫ n, accelerated technique makes the convergence much

faster. Aim Design a simpler accelerate VR stochastic method which can achieve the fastest convergence rate.

5

slide-7
SLIDE 7

Moreau Envelop and Douglas-Rachford (DR) Splitting

slide-8
SLIDE 8

Moreau Envelop

Formulaton f γ(x) = inf

y

  • f (y) + 1

2γ x − y2 . Properties

  • x∗ minimizes f (x) iff x∗ minimizes f γ(x)
  • f γ is continuously differentiable even when f is non-differentiable,

∇f γ(x) = (x − proxγ

f (x))/γ.

Moreover, f γ is 1/γ-smooth.

  • If f : µ-strongly convex, then f γ: µ/(µγ + 1)-strongly convex.
  • The condition number of f γ is (µγ + 1)/µγ, which may be better.

Proximal Point Algorithm (PPA) xk+1 = proxγ

f (xk) = xk − γ∇f γ(xk). 6

slide-9
SLIDE 9

Point-SAGA

Formulation Used when h is absent: min

x∈Rd f (x) := 1 n n

  • i=1

fi(x) Iteration zk

j

= xk + γ(g k

j − n

  • i=1

g k

i /n),

xk+1 = proxγ

fj(zk j )

g k+1

j

= (zk

j − xk+1)/γ,

Equivalence xk+1 = xk − γ

  • g k+1

j

− g k

j + n

  • i=1

g k

i /n

  • ,

where g k+1

j

is the gradient mapping of f at zk

j . 7

slide-10
SLIDE 10

Point-SAGA: Convergence rate

Strongly convex and smooth O

  • (n +
  • n L

µ) log(1 ǫ )

  • .

Strongly convex and non-smooth O 1 ǫ

  • .

8

slide-11
SLIDE 11

Douglas-Rachford (DR) Splitting

Formulation min

x∈Rd f (x) + h(x),

Iteration y k+1 = −xk + y k + proxγ

f (2xk − y k),

xk+1 = proxγ

h(y k+1).

Convergence

  • F(y) = y + proxγ

h(2proxγ f (y) − y) − proxγ f (y).

  • y is a fixed point of F if and only if x = proxγ

f (y) satisfies

0 ∈ ∂f (x) + ∂g(x): y = F(y) ⇄ 0 ∈ ∂f (proxγ

y (y)) + ∂g(proxγ y (y)). 9

slide-12
SLIDE 12

Our methods

slide-13
SLIDE 13

Algorithm

10

slide-14
SLIDE 14

Iterations

Main iterations y k+1 = xk − γ

  • g k+1

j

− g k

j + 1

n

n

  • i=1

g k

i

  • ,

xk+1 = proxγ

h(y k),

where g k+1

j

= 1 γ

  • (zk

j + xk − y k) − proxfj(zk j + xk − y k)

  • ,

the gradient mapping of fj at zk

j − xk − y k.

Number of parameters Prox2-SAGA Point-SAGA Katyusha Mig Acc-SDCA Catalyst 1 1 3 2 2 several

11

slide-15
SLIDE 15

Connections to other algorithms

Point-SAGA When h = 0, we have xk = yk for Prox2-SAGA, zk

j

= xk + γ

  • g k

j − 1

n

n

  • i=1

g k

i

  • ,

xk+1 = proxγ

fj(zk j ),

g k+1

j

= 1 γ (zk

j − xk+1).

DR splitting When n = 1, since g k

j = n i=1 g k i /n in Prox2-SAGA,

y k+1 = −xk + y k + proxγ

f (2xk − y k),

xk+1 = proxγ

h(y k+1). 12

slide-16
SLIDE 16

Theories

slide-17
SLIDE 17

Effectiveness

Proposition Suppose that (y ∞, {g ∞

i }i=1,...,n) is the fixed point of the Prox2-SAGA

  • iteration. Then x∞ = proxγ

h(y ∞) is a minimizer of f + h.

Proof. ∵ y ∞ = −x∞ + y ∞ + proxγ

fi (z∞ i

+ x∞ − y ∞), which implies (z∞

i

− y ∞)/γ ∈ ∂fi(x∞), i = 1, . . . , n. (1) Meanwhile, because x∞ = proxγ

h(y ∞), we have

(y ∞ − x∞)/γ ∈ ∂h(x∞). (2) Observing that 1 n

n

  • i=1

(z∞

i

− y ∞) + (y ∞ − x∞) = 1 n

n

  • i=1

z∞

i

− x∞ = 0, from (1) and (2), we have 0 ∈ ∂f (x∞) + ∂h(x∞).

13

slide-18
SLIDE 18

Convergence Rate

Non-strongly convex case Suppose that fi: convex and L-smooth, h: convex. Denote ¯ g k

j = 1 k

k

t=1 g t j , then for Prox2-SAGA with step size γ ≤ 1/L, at any

time k > 0 it holds E

  • ¯

g k

j − g ∗ j

  • 2 ≤ 1

k

  • n
  • i=1
  • g 0

i − g ∗ i

  • 2 + 1

γ (y 0 − y ∗)2 . Strongly convex case Suppose that fi: µ-strongly convex and L-smooth, h: convex. Then for Prox2-SAGA with stepsize γ = min

  • 1

µn,

9L2+3µL−3L 2µL

  • , for any time

k > 0 it holds E

  • xk−x∗

2 ≤

  • 1−

µγ 2µγ + 2 k· µγ − 2 2 − nµγ

  • n
  • i=1
  • γ(g 0

i −g ∗ i )

  • 2+y 0−y ∗2

.

14

slide-19
SLIDE 19

Remarks

  • When the stepsize

γ = min 1 µn,

  • 9L2 + 3µL − 3L

2µL

  • ,

then O(n + L/µ) log(1/ǫ) steps are required to achieve E

  • xk − x∗

2 ≤ ǫ.

  • When fi is ill-conditioned, then a large stepsize

γ = min 1 µn, 6L +

  • 36L2 − 6(n − 2)µL

2(n − 2)µL

  • is possible, under which the required steps is O(n +
  • nL/µ) log(1/ǫ).

15

slide-20
SLIDE 20

Experiments

slide-21
SLIDE 21

Experiments

Figure 2: Comparison of several algorithms with ℓ1ℓ2-Logistic Regression.

16

slide-22
SLIDE 22

Experiments

Figure 3: Comparison of several algorithms with ℓ1ℓ2-Logistic Regression.

17

slide-23
SLIDE 23

Experiments

20 40 60 80 epoch 10-6 10-4 10-2 100

  • bjective gap

svmguide3

Prox2-SAGA Prox-SAGA, Prox-SDCA Prox-SGD

10 20 30 40 50 60 70 epoch 10-6 10-4 10-2 100

  • bjective gap

rcv1

Prox2-SAGA Prox-SAGA, Prox-SDCA Prox-SGD

2 4 6 8 10 12 14 epoch 10-5 10-4 10-3 10-2 10-1 100

  • bjective gap

covtype

Prox2-SAGA Prox-SAGA, Prox-SDCA Prox-SGD

2 4 6 8 10 12 14 epoch 10-6 10-4 10-2 100

  • bjective gap

ijcnn1

Prox2-SAGA Prox-SAGA, Prox-SDCA Prox-SGD

Figure 4: Comparison of several algorithms with sparse SVMs.

18

slide-24
SLIDE 24

Conclusions

slide-25
SLIDE 25
  • Prox2-SAGA has combined Point-SAGA and DR splitting.
  • Point-SAGA provides faster convergence rate to Prox2-SAGA.
  • DR splitting provides the effectiveness.

19

slide-26
SLIDE 26

Q & A