[PPT] - An Accelerated Variance Reducing Stochastic Method with PowerPoint Presentation

SLIDE 1

An Accelerated Variance Reducing Stochastic Method with Douglas-Rachford Splitting

Jingchang Liu November 12, 2018

University of Science and Technology of China 1

SLIDE 2

Background

SLIDE 4

Problem

Formulation

Regularized ERM: min

x∈Rd f (x) + h(x) := 1 n n

i=1

fi(x) + h(x).

fi : Rd → R: empirical loss of i-th sample, convex.
h: regularization term, convex but possibly non-smooth.
Examples: LASSO, sparse SVM, ℓ1, ℓ2-Logistic Regression.

Definition

Proximal operator: proxγ

f (x) = argminy∈Rd

f (y) +

1 2γ y − x2

.

Gradient mapping: f (x) = 1

γ (x − proxγ f (x)).

Subdifferential: ∂f (x) =
g | g T(y − x) ≤ f (y) − f (x), ∀y ∈ dom f
.
Strongly convex: f (y) ≥ f (x) + g, y − x + µ

2 y − x2.

L-smooth: f (y) ≤ f (x) + ∇f (x), y − x + L

2 y − x2. 3

SLIDE 5

Related Works

Exsiting Algorithm proxγ

h(x − γ · ), where can be obtained from:

GD: = ∇f (x), more calculations needed in each iteration.
SGD: = ∇fi(x), small stepsize deduces slow convergence.
Variance reduction (VR): = ∇fi(x) − ∇fi(¯

x) + ∇f (x), such as SVRG, SAGA, SDCA. Accelerated Technique

Ill condition: L/µ, the condition number, is large.
Methods: Acc-SDCA, Catalyst, Mig, Point-SAGA.
Drawbacks: More parameters need to be tuned.

4

SLIDE 6

Rate

Convergence Rate

VR stochastic methods: O ((n + L/µ) log(1/ǫ)).
Acc-SDCA, Mig, Point-SAGA: O((n +
nL/µ) log(1/ǫ)).
When L/µ ≫ n, accelerated technique makes the convergence much

faster. Aim Design a simpler accelerate VR stochastic method which can achieve the fastest convergence rate.

5

SLIDE 7

Moreau Envelop and Douglas-Rachford (DR) Splitting

SLIDE 8

Moreau Envelop

Formulaton f γ(x) = inf

y

f (y) + 1

2γ x − y2 . Properties

x∗ minimizes f (x) iff x∗ minimizes f γ(x)
f γ is continuously differentiable even when f is non-differentiable,

∇f γ(x) = (x − proxγ

f (x))/γ.

Moreover, f γ is 1/γ-smooth.

If f : µ-strongly convex, then f γ: µ/(µγ + 1)-strongly convex.
The condition number of f γ is (µγ + 1)/µγ, which may be better.

Proximal Point Algorithm (PPA) xk+1 = proxγ

f (xk) = xk − γ∇f γ(xk). 6

SLIDE 9

Point-SAGA

Formulation Used when h is absent: min

x∈Rd f (x) := 1 n n

i=1

fi(x) Iteration zk

j

= xk + γ(g k

j − n

i=1

g k

i /n),

xk+1 = proxγ

fj(zk j )

g k+1

j

= (zk

j − xk+1)/γ,

Equivalence xk+1 = xk − γ

g k+1

j

− g k

j + n

i=1

g k

i /n

,

where g k+1

j

is the gradient mapping of f at zk

j . 7

SLIDE 10

Point-SAGA: Convergence rate

Strongly convex and smooth O

(n +
n L

µ) log(1 ǫ )

.

Strongly convex and non-smooth O 1 ǫ

.

8

SLIDE 11

Douglas-Rachford (DR) Splitting

Formulation min

x∈Rd f (x) + h(x),

Iteration y k+1 = −xk + y k + proxγ

f (2xk − y k),

xk+1 = proxγ

h(y k+1).

Convergence

F(y) = y + proxγ

h(2proxγ f (y) − y) − proxγ f (y).

y is a fixed point of F if and only if x = proxγ

f (y) satisfies

0 ∈ ∂f (x) + ∂g(x): y = F(y) ⇄ 0 ∈ ∂f (proxγ

y (y)) + ∂g(proxγ y (y)). 9

SLIDE 12

Our methods

SLIDE 13

Algorithm

10

SLIDE 14

Iterations

Main iterations y k+1 = xk − γ

g k+1

j

− g k

j + 1

n

i=1

g k

i

,

xk+1 = proxγ

h(y k),

where g k+1

j

= 1 γ

(zk

j + xk − y k) − proxfj(zk j + xk − y k)

,

the gradient mapping of fj at zk

j − xk − y k.

Number of parameters Prox2-SAGA Point-SAGA Katyusha Mig Acc-SDCA Catalyst 1 1 3 2 2 several

11

SLIDE 15

Connections to other algorithms

Point-SAGA When h = 0, we have xk = yk for Prox2-SAGA, zk

j

= xk + γ

g k

j − 1

n

i=1

g k

i

,

xk+1 = proxγ

fj(zk j ),

g k+1

j

= 1 γ (zk

j − xk+1).

DR splitting When n = 1, since g k

j = n i=1 g k i /n in Prox2-SAGA,

y k+1 = −xk + y k + proxγ

f (2xk − y k),

xk+1 = proxγ

h(y k+1). 12

SLIDE 16

Theories

SLIDE 17

Effectiveness

Proposition Suppose that (y ∞, {g ∞

i }i=1,...,n) is the fixed point of the Prox2-SAGA

iteration. Then x∞ = proxγ

h(y ∞) is a minimizer of f + h.

Proof. ∵ y ∞ = −x∞ + y ∞ + proxγ

fi (z∞ i

+ x∞ − y ∞), which implies (z∞

i

− y ∞)/γ ∈ ∂fi(x∞), i = 1, . . . , n. (1) Meanwhile, because x∞ = proxγ

h(y ∞), we have

(y ∞ − x∞)/γ ∈ ∂h(x∞). (2) Observing that 1 n

n

i=1

(z∞

i

− y ∞) + (y ∞ − x∞) = 1 n

n

i=1

z∞

i

− x∞ = 0, from (1) and (2), we have 0 ∈ ∂f (x∞) + ∂h(x∞).

13

SLIDE 18

Convergence Rate

Non-strongly convex case Suppose that fi: convex and L-smooth, h: convex. Denote ¯ g k

j = 1 k

k

t=1 g t j , then for Prox2-SAGA with step size γ ≤ 1/L, at any

time k > 0 it holds E

¯

g k

j − g ∗ j

2 ≤ 1

k

n
i=1
g 0

i − g ∗ i

2 + 1

γ (y 0 − y ∗)2 . Strongly convex case Suppose that fi: µ-strongly convex and L-smooth, h: convex. Then for Prox2-SAGA with stepsize γ = min

1

µn,

√

9L2+3µL−3L 2µL

, for any time

k > 0 it holds E

xk−x∗

2 ≤

1−

µγ 2µγ + 2 k· µγ − 2 2 − nµγ

n
i=1
γ(g 0

i −g ∗ i )

2+y 0−y ∗2

.

14

SLIDE 19

Remarks

When the stepsize

γ = min 1 µn,

9L2 + 3µL − 3L

2µL

,

then O(n + L/µ) log(1/ǫ) steps are required to achieve E

xk − x∗

2 ≤ ǫ.

When fi is ill-conditioned, then a large stepsize

γ = min 1 µn, 6L +

36L2 − 6(n − 2)µL

2(n − 2)µL

is possible, under which the required steps is O(n +
nL/µ) log(1/ǫ).

15

SLIDE 20

Experiments

SLIDE 21

Experiments

Figure 2: Comparison of several algorithms with ℓ1ℓ2-Logistic Regression.

16

SLIDE 22

Experiments

Figure 3: Comparison of several algorithms with ℓ1ℓ2-Logistic Regression.

17

SLIDE 23

Experiments

20 40 60 80 epoch 10-6 10-4 10-2 100

bjective gap

svmguide3

Prox2-SAGA Prox-SAGA, Prox-SDCA Prox-SGD

10 20 30 40 50 60 70 epoch 10-6 10-4 10-2 100

bjective gap

rcv1

Prox2-SAGA Prox-SAGA, Prox-SDCA Prox-SGD

2 4 6 8 10 12 14 epoch 10-5 10-4 10-3 10-2 10-1 100

bjective gap

covtype

Prox2-SAGA Prox-SAGA, Prox-SDCA Prox-SGD

2 4 6 8 10 12 14 epoch 10-6 10-4 10-2 100

bjective gap

ijcnn1

Prox2-SAGA Prox-SAGA, Prox-SDCA Prox-SGD

Figure 4: Comparison of several algorithms with sparse SVMs.

18

SLIDE 24

Conclusions

SLIDE 25

Prox2-SAGA has combined Point-SAGA and DR splitting.
Point-SAGA provides faster convergence rate to Prox2-SAGA.
DR splitting provides the effectiveness.

19

SLIDE 26

An Accelerated Variance Reducing Stochastic Method with Douglas-Rachford Splitting

Jingchang Liu November 12, 2018

Table of Contents

Background Moreau Envelop and Douglas-Rachford (DR) Splitting Our methods Theories Experiments Conclusions Q & A

Background

Problem

Formulation

fi(x) + h(x).

Definition

.

Related Works

Exsiting Algorithm proxγ

x) + ∇f (x), such as SVRG, SAGA, SDCA. Accelerated Technique

Rate

Convergence Rate

faster. Aim Design a simpler accelerate VR stochastic method which can achieve the fastest convergence rate.

Moreau Envelop and Douglas-Rachford (DR) Splitting

Moreau Envelop

Formulaton f γ(x) = inf

2γ x − y2 . Properties

∇f γ(x) = (x − proxγ

Moreover, f γ is 1/γ-smooth.

Proximal Point Algorithm (PPA) xk+1 = proxγ

Point-SAGA

Formulation Used when h is absent: min

fi(x) Iteration zk

= xk + γ(g k

g k

xk+1 = proxγ

g k+1

= (zk

Equivalence xk+1 = xk − γ

− g k

g k

where g k+1

is the gradient mapping of f at zk

Point-SAGA: Convergence rate

Strongly convex and smooth O

µ) log(1 ǫ )

Strongly convex and non-smooth O 1 ǫ

Douglas-Rachford (DR) Splitting

Formulation min

Iteration y k+1 = −xk + y k + proxγ

xk+1 = proxγ

Convergence

0 ∈ ∂f (x) + ∂g(x): y = F(y) ⇄ 0 ∈ ∂f (proxγ

Our methods

Algorithm

Iterations

Main iterations y k+1 = xk − γ

− g k

n

g k

xk+1 = proxγ

where g k+1

= 1 γ

the gradient mapping of fj at zk

Number of parameters Prox2-SAGA Point-SAGA Katyusha Mig Acc-SDCA Catalyst 1 1 3 2 2 several

Connections to other algorithms

Point-SAGA When h = 0, we have xk = yk for Prox2-SAGA, zk

= xk + γ

n

g k

xk+1 = proxγ

g k+1

= 1 γ (zk

DR splitting When n = 1, since g k

y k+1 = −xk + y k + proxγ

xk+1 = proxγ

Theories

Effectiveness

Proposition Suppose that (y ∞, {g ∞

Proof. ∵ y ∞ = −x∞ + y ∞ + proxγ

+ x∞ − y ∞), which implies (z∞

− y ∞)/γ ∈ ∂fi(x∞), i = 1, . . . , n. (1) Meanwhile, because x∞ = proxγ

(y ∞ − x∞)/γ ∈ ∂h(x∞). (2) Observing that 1 n

(z∞

− y ∞) + (y ∞ − x∞) = 1 n

z∞

− x∞ = 0, from (1) and (2), we have 0 ∈ ∂f (x∞) + ∂h(x∞).