[PPT] - Selective Linearization Method for Statistical Learning Problems Yu PowerPoint Presentation

SLIDE 1

Selective Linearization Method for Statistical Learning Problems Yu Du yu.du@ucdenver.edu

Joint work: Andrzej Ruszczynski

DIMACS Workshop on ADMM and Proximal Splitting Methods in Optimization June 2018

SLIDE 2

Agenda

1

Introduction to multi-block convex non-smooth optimization Motivating examples Problem formulation

2

Review of related existing methods Proximal point and operator splitting methods Bundle method Alternating linearization method(ALIN)

3

Selective linearization (SLIN) for multi-block convex optimization SLIN method for multi-block convex optimization Global convergence Convergence rate

4

Numerical illustration Three-block fused lasso Overlapping group lasso Regularized support vector machine problem

5

Conclusions and ongoing work

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 3

Motivating examples for multi-block structured regularization min

x∈Rn F(x) = f(x) + N

i=1

hi(Bix)

Figure: [Zhou et al., 2015] Figure: [Demiralp et al., 2013]

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 4

Motivating examples for multi-block structured regularization min

S∈S ||PΩ(S)−PΩ(A)||2 F +γ||RS||1 +τ||S||∗

Figure: [Zhou et al., 2015]

min

w

1 2Kλ

b − Aw
2

2 + K

j=1

dj

wTj
2

Figure: [Demiralp et al., 2013]

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 5

Problem formulation for multi-block convex optimization

Problem formulation

min

x∈❘n F(x) = f1(x) + f2(x) + . . . + fN(x)

f1, f2, ..., fN : ❘n → ❘ are convex functions. We introduced the selective linearization (SLIN) algorithm for multi-block non-smooth convex optimization. Global convergence is guaranteed; Almost O(1/k) convergence rate with only 1 out of N functions being strongly convex, where k is the iteration number.(Du, Lin, and Ruszczynski, 2017).

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 6

Review of Related Existing Methods

Selective linearization algorithm comes from the idea of proximal point algorithm (Rockafellar, 1976) and operator splitting methods (Douglas and Rachford, 1956), (Lions and Mercier, 1979), and later by (Eckstein and Bertsekas, 1992) and (Bauschke and Combettes, 2011), bundle methods(Kiwiel, 1985) and (Ruszczynski, 2006) and Alternating linearzation method (Kiwiel, Rosa, and Ruszczynski, 1999). Proximal point method

min

x∈❘n F(x)

where F : ❘n → ❘ is a convex function. For solving the above problem, construct a proximal step proxF(xk) = argminx

F(x) + ρ

2

x − xk
2

and xk+1 = proxF(xk), k = 1, 2, . . . . It is known to converge to the minimum of F(·) (Rockafellar, 1976).

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 7

Bundle method

The main idea of the bundle method (Kiwiel, 1985) and (Ruszczynski, 2006) is to replace problem minx∈❘n F(x) with a sequence of approximate problems of the form (cutting plane approximation):

min

x∈❘n ˜

Fk(x) + ρ 2

x − xk
2

Here ˜ Fk(·) is a piecewise linear convex lower approximation of the function F(·).

˜

Fk(x) = max

j∈Jk

F(zj) + gj, x − zj
with some earlier generated solutions zj and subgradients gj ∈ ∂F(zj),

j ∈ Jk, where Jk ⊆ {1, . . . , k}. The solution of proximal step is subject to a sufficient improvement test, which decides the proximal center will change to current solution

r not.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 8

Bundle method(cont) ˜

Fk(x) = maxj∈Jk

F(zj) + gj, x − zj
Yu Du yu.du@ucdenver.edu

SLIN Methods for Statistical Learning Problems

SLIDE 9

Bundle method(cont)

Bundle method with multiple cuts Step 1: Initialization: Set k = 1, J1 = {1}, z1 = x1, and select g1 ∈ ∂F(z1). Choose parameter β ∈ (0, 1), and a stopping precision ε > 0. Step 2: zk+1 ← argmin{˜ Fk(x) + ρ

2

x − xk
2}.

Step 3: if (F(xk) − ˜ Fk(zk+1) ≤ ε) stop, otherwise continue Step 4: Update Test: if (F(zk+1) ≤ F(xk) − β

F(xk) − ˜

Fk(zk+1)

), then set

xk+1 = zk+1 (descent step); otherwise set xk+1 = xk (null step). Step 5:Select a set Jk+1 so that Jk ∪ {k + 1} ⊇ Jk+1 ⊇ {k + 1} ∪

j ∈ Jk : F(zj) + gj, zk+1 − zj = ˜

Fk(zk+1)

.

Increase k by 1 and go to Step 1.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 10

Bundle method(cont)

Bundle method with cut aggregation Step 1: Initialization: Set k = 1, J1 = {1}, z1 = x1, and select g1 ∈ ∂F(z1). Choose parameter β ∈ (0, 1), and a stopping precision ε > 0. Step 2: zk+1 ← argmin{˜ Fk(x) + ρ

2

x − xk
2}.

˜

Fk(x) = max

¯

Fk(x), F(zk) + gk, x − zk

Step 3: if (F(xk) − ˜

Fk(zk+1) ≤ ε) stop, otherwise continue Step 4: if (F(zk+1) ≤ F(xk) − β

F(xk) − ˜

Fk(zk+1)

), then set xk+1 = zk+1

(descent step); otherwise set xk+1 = xk (null step). Step 5: Define ¯ Fk+1(x) = θk ¯ Fk(x) + (1 − θk)

F(zk) + gk, x − zk
where

θk ∈ [0, 1] is such that the gradient of ¯

Fk+1(·) is equal to the subgradient of

˜

Fk(·) at zk+1 that satisfies the optimality conditions for the problem. Increase k by 1 and go to Step 1.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 11

Bundle method(cont)

Convergence of the bundle method (in both versions) for convex functions is well-known (Kiwiel, 1985) and (Ruszczynski, 2006). Theorem Suppose Argmin F ∅ and ε = 0. Then a point x∗ ∈ Argmin F exists, such that:

lim

k→∞ xk = lim k→∞ zk = x∗.

Convergence Rate (Du and Ruszczynski, 2017) proved that the bundle method for nonsmooth

ptimization achieves solution accuracy ǫ in at most O(

ln( 1

ǫ )

ǫ

) iterations, if

the function is strongly convex. The result is true for the versions of the method with multiple cuts and with cut aggregation.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 12

Operator splitting for two block convex optimization

Operator Splitting

min

x∈❘n f(x) + h(x)

The solution ˆ x satisfies 0 ∈ ∂f(ˆ x) + ∂h(ˆ x), where we can consider two subdifferentials as two maximal monotone operators M1 and M2 on the space ❘n: 0 ∈ (M1 + M2)(ˆ x). Standard ADMM

min

x∈❘n f(x) + h(y)

s.t. Mx − y = 0 ADMM for solving the above takes the following form, for some scalar parameter c > 0: xk+1 ∈ argminx∈❘n{f(x) + g(yk) + λk, Mx − yk + c

2||Mx − yk||2}

yk+1 ∈ argminy∈❘m{f(xk+1) + g(y) + λk, Mxk+1 − y + c

2||Mxk+1 − y||2}

λk+1 = λk + c(Mxk+1 − yk+1).

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 13

Alternating Linearization method for two block convex optimization

The Alternating Linearization Method (ALIN) (Kiwiel, Rosa, and Ruszczynski, 1999) adapted ideas of the operator splitting methods and bundle methods and proved the globlal convergence. ALIN (Lin, Pham and Ruszczy´ nski, 2014) is sucessfully applied to solve two block structured regularization statistical learning problems. Algorithm Alternating Linearization (ALIN)

min

x∈❘n f(x) + h(x)

1: repeat 2: ˜ xh ← argmin{˜ f(x) + h(x) + 1

2||x − ˆ

x||2

D}.

3: gh ← −gf − D(˜ xh − ˆ x) 4: if (Update Test for ˜ xh) then ˆ x ← ˜ xh end if 5: ˜ xf ← argmin{f(x) + ˜ h(x) + 1

2||x − ˆ

x||2

D}.

6: gf ← −gh − D(˜ xf − ˆ x) 7: if (Update Test for ˜ xf) then ˆ x ← ˜ xf end if 8: until (Stopping Test)

˜

f(x) is the linear approximation of f(x). D is a positive definite diagonal matrix.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 14

Selective Linearization (SLIN) algorithm

Objective: min F(x) = N

i=1 fi(x)

f1, f2, . . . , fN : ❘n → ❘ are convex functions. They can be nonsmooth. Every iteration We choose an index j according to a selection rule (the largest gap between the function value and its linear approximation) and solve the fj-sub-problem:

min

x

fj(x) +

ij

˜

fk

i (x) + 1

2||x − xk||2

D

Each ˜ fk

i (x) is a first-order linearization of fi(x).

xk is a proximal center. It will be updated over the iterations. After solving fj-sub-problem, fj will be linearized using its subgradient at current solution zk

j preparing for the next sub-problem:

˜

fk

i (x) = fi(zk i ) + gk i , x − zk i .

We denote the function approximating: ˜ Fk(x) = fj(x) +

ij ˜

fk

i (x).

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 15

Selective Linearization (SLIN) algorithm

At the end of each iteration: Stopping rule : Check whether certain precision level ε is achieved. F(xk) − ˜ Fk(zk

j ) ≤ ε

Update rule: Check whether there is an sufficient improvement in F(·), then switch the proximal center to zk

j :

F(zk

j ) ≤ F(xk) − β

F(xk) − ˜

Fk(zk

j )

Derive the subgradient gk

fj of the fj from the optimality condition:

gk

fj = − ij gk i (x) − D(zk j − xk).

Find the next index j′( j) of the function to treat exactly at the next iteration: j′ = argmax

ij

fi(zk

j ) − ˜

fk

i (zk j )

.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 16

Global Convergence

We adapted the similar idea of proof of bundle methods to get the global convergence of SLIN method. We rely on the assumption that the subgradients of a convex function are locally bounded. Global Convergence results Suppose Argmin F ∅ and ε = 0. Then a point x∗ ∈ Argmin F exists, such that:

lim

k→∞ zk j = lim k→∞ xk = x∗;

lim

k→∞ ηk = lim k→∞ F(xk) = F(x∗).

where optimal objective function value of subproblem at iteration k

ηk = minx fj(x) +

ij ˜

fk

i (x) + 1 2

x − xk
2

D.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 17

Global Convergence (cont)

We first proved that the optimal objective function values of consecutive subproblems ηk are increasing. And the gap is bounded below by a quantity vk = F(xk) − ˜ Fk(zk

jk ).

Lemma If a null step is made at iteration k, then

ηk+1 ≥ ηk +

1 − β 2(N − 1)¯

µkvk,

where

¯ µk = min

1,

(1 − β)vk (N − 1)sk

jk+1 − gk jk+12 D−1

,

with an arbitrary sk

jk+1 ∈ ∂fjk+1(zk jk ).

Assumption There is a constant M exists, such that sk

jk+1 − gk jk+12 D−1 < M.

And ǫ ≤ (N − 1)M .

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 18

Global Convergence (cont)

We also need to estimate the size of the steps made by the method. Lemma At every iteration k, FD(xk) − ηk ≥ 1

2

zk

jk − proxF(xk)

2

D.

where FD(y) = min

F(x) + 1

2

x − y
2

D

, and

proxF(y) = argmin

F(x) + 1

2

x − y
2

D

.

We prove convergence in the two cases with finitely or infinitely many descent steps. Theorem Suppose ε = 0, the set K = {1} ∪ {k > 1 : xk xk−1} is finite and

inf F > −∞. Let m ∈ K be the largest index such that xm xm−1. Then

xm ∈ Argmin F. Theorem Suppose Argmin F ∅. If the set K = {k : xk+1 xk} is infinite, then

limk→∞ xk = x∗, for some x∗ ∈ Argmin F.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 19

Convergence rate

Convergence rate Almost O( 1

k ) convergence rate if F(x) is strongly convex with parameter α

(a sufficient condition is that one fi(x) is strongly convex): Bound the number of proximal centers generated for precision level ǫ. Bound the number of iterations for any fixed proximal center. Assumption The function F(·) has a unique minimum point x∗ and a constant α > 0 exists, such that F(x) − F(x∗) ≥ α

x − x∗
2,

for all x ∈ ❘n with F(x) ≤ F(x1).

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 20

Convergence rate (cont)

Lemma Then at every iteration k we have F(xk) − F(x∗) ≤ F(xk)−ηk

min(α,1) .

By virtue of lemma, strongly convex assumption and update rule, we proved linear rate of convergence between descent steps: Lemma At every descent step k we have F(xk+1) − F(x∗) ≤ (1 − ¯

αβ)

F(xk) − F(x∗)
,

where ¯

α = min(α, 1).

By virtue of update rule and stopping rule, the proximal center is updated

nly if, F(xk) − F(x∗) ≥ βε we obtain number of descent steps L:

L ≤ 1 +

ln(βε) − ln

F(x1) − F(x∗)
ln(1 − ¯

αβ) .

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 21

Convergence rate (cont)

First, we prove linear rate of convergence between descent steps.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 22

Convergence rate (cont)

We now pass to the second issue: the estimation of the number of null steps between two consecutive descent steps. We shall base it on the analysis of the gap F(xk) − ηk. Lemma If a null step occurs at iteration k, then F(xk) − ηk+1 ≤ γ

F(xk) − ηk

where γ = 1 − (1−β)2ε

2M

.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 23

Convergence rate (cont)

Let x(ℓ−1), x(ℓ), x(ℓ+1) be three consecutive proximal centers for ℓ ≥ 2 in the

algorithm. We want to bound the number of iterations made with proximal

center x(ℓ). To this end, we bound two quantities: F(x(ℓ)) − ηk(ℓ), where k(ℓ) is the first step with proximal center x(ℓ) F(x(ℓ)) − ηk ′(ℓ), where k ′(ℓ) is the last step with proximal center x(ℓ). Lemma If a descent step is made at iteration k(ℓ) − 1, then F(x(ℓ)) − ηk(ℓ) ≤ 3 2β

F(x(ℓ−1)) − F(x(ℓ))
.

From previous lemma F(xk) − F(x∗) ≤ F(xk)−ηk

min(α,1) , we obtain the following

inequality at every (including the last) null step with prox center x(ℓ): F(x(ℓ)) − ηk ′(ℓ) ≥ ¯

α

F(x(ℓ)) − F(x∗)
≥ ¯

α

F(x(ℓ)) − F(xℓ+1)
.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 24

Convergence rate (cont)

Therefore, the number nℓ of null steps with proximal center x(ℓ), if it is positive, satisfies the inequality: 3 2β

F(x(ℓ−1)) − F(x(ℓ))
γnℓ−1 ≥ ¯

α

F(x(ℓ)) − F(x(ℓ+1))
.

Then we obtain the following upper bound on the number of null steps for

ℓ ≥ 2:

nℓ ≤ 1 + 1

ln(γ) ln      

2β¯

α

3 F(x(ℓ)) − F(x(ℓ+1)) F(x(ℓ−1)) − F(x(ℓ))

      .

For the first series, we obtain the bound n1 ≤ 1 + 1

ln(γ) ln      ¯ αF(x(1)) − F(x(2))

F(x(1)) − η1

      .

For the last series, we obtain nL ≤ 1 + 1

ln(γ) ln β

3

ε

F(x(L−1)) − F(x(L))

.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 25

Convergence rate (cont)

We aggregate the total number of null steps (with good cancellation rule):

L

ℓ=1

nℓ ≤ L − 1

ln(γ)

ln(¯

α) + ln 2β¯ α

3

+ ln

β

3

+

1 L − 1 ln

ε

F(x1) − η1

+ L.

Remind the previous result of number of descent steps: L ≤ 1 +

ln(βε) − ln

F(x1) − F(x∗)
ln(1 − ¯

αβ) .

As a result, we have the final bound for the total number of descent and null steps, where C =

(1−β)2 2M(N−1)2 :

L +

L

ℓ=1

nℓ

≤

1

εC ln(1 − ¯ αβ) ln F(x1) − F(x∗) βε        ln(¯ α) + ln 2β¯ α

3

+ ln

β

3

       + 1 εC ln F(x1) − η1 ε

+ 2

ln(βε) − ln

F(x1) − F(x∗)
ln(1 − ¯

αβ) + 2.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 26

Convergence rate (cont)

Therefore in order to achieve precision ε, the number of steps needed is of

rder

L +

L

ℓ=1

nℓ ∼ O

      

1

ε ln 1 ε       .

This is almost equivalent to saying that given the number of iterations k, the precision of the solution is approximately O(1/k)(Du, Lin and Ruszczynski, 2017).

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 27

Experiment: Three-block fused lasso problem min

x

1 2||Ax − b||2

f1(x)

+ λ1||x||1

f2(x)

+ λ2

j

||xj+1 − xj||1

f3(x)

where matrix A has the dimension n × p. gfi denotes the (sub-)gradient of the fi-subproblem evaluated at zi. D = diag(ATA).

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 28

Subproblems

f1-subproblem

min

x

1 2||b − Ax||2

2 + gT f2x + gT f3x + 1

2||x − ˆ x||2

D.

This is an unconstrained quadratic optimization problem and can be solved efficiently by preconditioned conjugate gradient method. f2-sub problem

min

x

gT

f1x + λ1||x||1 + gT f3x + 1

2||x − ˆ x||2

D.

Closed form solution: (xf2)i = sgn(τi) max(0, |τi| − λ1

di ), i = 1, . . . , m.

Here τi = (x0)i −

(gf1)i+(gf3)i di

.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 29

Subproblems (cont)

f3-sub problem

min

x

gT

f1x + gT f2x + λ2

j

||xj+1 − xj||1 + 1

2||x − ˆ x||2

D.

According to (Lin, Pham and Ruszczy´ nski, 2014), this problem can be treated as a box-constrained quadratic programming problem. It’s equivalent to solving the following box-constrained quadratic programming problem:

max

µ

−1

2µTRD−1RTµ + µTR(x0 − D−1gf1 − D−1gf2), subject to ||µ||∞ ≤ λ2, where ||Rx||1 =

j ||xj+1 − xj||1. It can be solved by a box-constrained

conjugate gradient method or block coordinate descent method.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 30

Comparison with other related methods

Method Selection Update Proximal Relaxation Label Rule Rule Parameter D Parameter δ SLIN

Alternating Linearization
DRc
DRc over relax
DRc under relax
DRρ

Table: Main features comparison

*over relax: δ = 1.5; under relax: δ = 0.5; DRρ : ρ = constant values.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 31

Comparison with other related methods(Cont)

100 101 102 103 Iterations

10
5

5 10 15 20 ln(F(x) - F(x*)) SLIN Cyclical Linearization DRC DRC over relax DRC under relax DR Vu splitting

Figure: Comparison of methods on problems with n = 10000 and p = 1000.

100 101 102 103 Iterations

10
5

5 10 15 20 ln(F(x) - F(x*)) SLIN Cyclical Linearization DRC DRC over relax DRC under relax DR Vu splitting

Figure: Comparison of methods on problems with n = 3000 and p = 4000.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 32

Comparison with other related methods (Cont)

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Sample size 5 10 15 20 25 30 35 40 45 Running time SLIN Cyclical Linearization DRC DRC over relax DRC under relax DR Vu splitting

Figure: Running time of SLIN and other methods as sample size changes when p = 1000.

500 1000 1500 2000 2500 3000 3500 4000 Variable dimension 100 200 300 400 500 600 Running time SLIN Cyclical Linearizaton DRC DRC over relax DRC under relax DR Vu splitting

Figure: Running time of SLIN and other methods as dimension changes when n = 3000.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 33

Experiment: Overlapping group lasso

Overlapping group lasso

min

x

1 2Kλ

b − Ax
2

2

f0(x)

+

K

j=1

dj

xGj
2
fj(x)

where matrix A has the dimension n × p.

Gj ⊆ {1, . . . , p} is the index set of a group (subset) of variables and xGj

denotes a copy of x with all variables not contained in the group Gj set to 0. There are K groups. We adopt the uniform weight dj = 1/K and set λ = K/5. D =

1 Kλdiag(ATA)

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 34

Subproblems

f0-subproblem

min

x

1 2Kλ

b − Ax
2

2 + K

j=1

gT

fj x + 1

2

x − ˆ

x

2

D.

This is a unconstrained quadratic optimization problem. Its optimal solution can be obtained by solving the following linear system of equations by preconditioned conjugate gradient method:

1

KλATA + D

x =

1 KλATb −

K

j=1

gfj + Dˆ x. fj-subproblem

min dj

xGj
2 + s, x + 1

2

x − ˆ

x

2

D.

where s = gf0 +

j′j gfj′.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 35

Solving fj-subproblems

fj-subproblem The decision variables that are outside of current group Gj have closed-form solution: x−Gj = ˆ x−Gj − D−1

−Gjs−Gj.

The variables in the current group Gj is solved from the first order conditions

f fj-subproblem if xGj 0

djxGj

xGj
2

+ sGj + DGj(xGj − ˆ

xGj) = 0. We denote

dj

xGj
2

= γ. This leads to the following equation for γ:

i∈Gj
Diixk

i −si

1+

Dii γ

2 = d2

j . Letting γ → ∞ on the left hand side, we obtain the

condition for the existence of a solution

i∈Gj

Diixk

i − si

2 > d2

j . If above

inequality is satisfied, κ can be found by bisection. Otherwise the optimal solution is xGj = 0.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 36

Overlapping group lasso: Tree-structured overlapping groups min

w

1 2Kλ

b − Aw
2

2 + K

j=1

dj

wTj
2

Table: Comparison of SLIN and FISTA on tree-structured overlapping group lasso problem.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 37

Overlapping group lasso: fixed-order groups

We define a sequence of groups of 100 adjacent inputs with an overlap of 10 variables between two successive groups so that G = {{1, . . . , 100}, {91, . . . , 190}, . . . , {p − 99, . . . , p}}, p = 90|G| + 10 Comparing SLIN with other algorithms when K=100, n = 1000:

100 101 102 103 104 Iterations 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 Objective value SLIN PDMM PA-APG S-APG sADMM

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 38

Overlapping group lasso: fixed-order groups(Cont’d)

10 20 30 40 50 60 70 80 90 100 group number 10 20 30 40 50 60 70 Running time SLIN PDMM sADMM PA-APG S-APG

Figure: Running time of SLIN and other methods as group number changes when n = 1000.

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Sample size 50 100 150 200 250 300 350 Running time SLIN PDMM sADMM PA-APG S-APG

Figure: Running time of SLIN and other methods as sample size changes when K = 100.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 39

Overlapping group lasso: randomly generated groups min

w

1 2Kλ

b − Aw
2

2 + K

j=1

dj

wGj
2

Table: Comparison SLIN and PDMM in solving an overlapping group lasso of randomly generated groups.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 40

Experiment: Regularized support vector machine problem

Consider the regularized SVM problem when the loss functions are sum of nonsmooth functions:

min

x0,x

1 n

n

i=1

(1 − bj(x0 + AT

j x))+ + λ1

x
1 + λ2

2

x
2

2

where λ1, λ2 > 0 are regularization parameters. Due to the properties of the elastic net regularization, the optimal solution will have both parse and grouping effect. The f-subproblem(Option 1). The f-subproblem has the form:

min

x,v

1 n

n

j=1

vj + gT

h1x + gT h2x + 1

2

x − xk
2

D.

vj ≥ 1 − bj(x0 + AT

j x),

j = 1, . . . , n. vj ≥ 0 This is a constrained quadratic programming problem and its optimal solution can be obtained by active sets method. Note vj = (1 − bj(x0 + AT

j x))+.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 41

The fj-subproblem(Option 2). Solving for each block j, the fj-subproblem has the form:

min

x,vj

1 nvj + 1 n

n+1

ij

lT

i x + 1

2

x − xk
2

D.

vj ≥ 1 − bj(x0 + AT

j x)

vj ≥ 0 where n+1

ij

li =

ij gT fj + gT h1 + gT h2; with gfj denoting a subgradient of the

function fj. To solve this above problem, we consider three cases.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 42

Case 1: when vj = 0, we solve in closed form in x, and check the condition if 0 ≥ 1 − bj(x0 + AT

j x)

condition satisfies, otherwise go to case 2. Case 2: when vj ≥ 0, Then vj = 1 − bj(x0 + AT

j x), then we solve the following unconstrained

problem

min

x

1 n(1 − bj(x0 + AT

j x)) + 1

n

n+1

ij

lT

i x + 1

2

x − xk
2

D.

We solve for x in closed form and calculate vj. If vj > 0, we found the solutions, otherwise, go to case 3. Case 3: when vj = 1 − bj(x0 + AT

j x) = 0,

we solve the following constrained problem

min

x

1 n

n+1

ij

lT

i x + 1

2

x − xk
2

D.

0 = 1 − bj(x0 + AT

j x)

We calculate x in closed form as a search for lagrangian multiplier µ. Then check if µ matches for the linear constraint.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 43

Experiment: Regularized support vector machine problem

n, p Method

ρ = 0 ρ = 0.8

n=50 SLIN 0.22 0.13 p=300 ENSVM-ADMM 0.51 0.41 n=100 SLIN 0.65 0.43 p=500 ENSVM-ADMM 1.29 0.81 n=200 SLIN 1.15 1.34 p=1000 ENSVM-ADMM 3.71 3.85

Table: Time comparison of SLIN and ADMM on support vector machine problems

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 44

Conclusions

We introduce the selective linearization (SLIN) algorithm for multi-block non-smooth convex optimization. It proposes one operator-splitting type methods which are globally convergent for an arbitrary number of

perators without artificial duplication of variables.

Under a strongly convex condition, SLIN algorithm is proved to converge at rate of almost O( 1

k ). It is a major contribution even in the

case of two blocks. The technique invented by us can be also used to derive the rate of convergence of the classical bundle method and ALIN method, for which no convergence rate estimate has been available so far. We have done extensive comparison experiments in statistical learning problems such as fused lasso regularization problems and overlapping group lasso problems. The experiment results demonstrate the efficacy and accuracy of the method.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 45

Ongoing work

Study the convergence rate of the SLIN algorithm on general convex, but not strongly convex objective functions. Extend SLIN to solve non-convex optimization problems. Design the algorithm for multi-block convex problem with linear

perators and analyze the convergence pattern of the algorithm for

general constrained problems:

min

x∈X F(x) = f1(x) + N

i=2

fi(Mx) Design the stochastic version of SLIN method for solving large scale problems including recommendation matrix completion, medical imaging problem, dictionary learning, and some deep learning problems, etc, to better understand its practical performance.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 46

Related papers

Y. Du, X. Lin, A. Ruszczynski, A Selective Linearization Method for

Multiblbock Convex Optimization, SIAM Journal on Optimization 27 (2017), 1102-1117. https://doi.org/10.1137/15M103217X

Y. Du, A. Ruszczynski, Rate of convergence of the bundle method,

Journal of Optimization Theory and Applications 173 (2017), 908–922. https://doi.org/10.1007/s1095

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 47

References

A. Ruszczy´

nski, Nonlinear optimization, Princeton University Press, Princeton, NJ, 2006.

K. Kiwiel, C. Rosa, and A. Ruszczy´

nski, Proximal decomposition via alternating linearization, SIAM Journal on Optimization, 9:153-172, 1999.

X. Lin, M. Pham and A. Ruszczy´

nski, Alternating linearization for structured regularization problems, Journal of Machine Learning Research, 15 (2014), 3447-3481.

R. Jenatton, J. Mairal, G. Obozinski, and F. Bach, Proximal methods for

hierarchical sparse coding, Journal of Machine Learning Research, 12 (2011), 2297-2334.

H. Bauschke and P

. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces, CMS Books in Mathematics/Ouvrages de Math Ì ˛ Aematiques de la SMC. Springer, New York, 2011.

J. Eckstein and D. P

. Bertsekas. On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators, Math. Programming, 55(3, Ser. A):293-318, 1992.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 48

References

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization

and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1â ˘ A ¸ S122, 2010.

J. Huang, S. Zhang and D. Metaxas, Efficient MR Image Reconstruction for

Compressed MR Imaging, Medical Image Analysis (MedIA), Volume 15, Issue 5, Pages 670-679, 2011.

C. Demiralp, E. Hayden, J. Hammerbacher, and J. Heer, Exploring

High-Dimensional RNA Sequences from In Vitro Selection, IEEE Biological Data Visualization (BioVis), 2013

J. Zhou, P

. Gong, Z. Wang, and J. Ye, Mining Structured Sparsity Beyond Convexity, ICDM Tutorial, 2015

R. Tibshirani and P

. Wang, Spatial smoothing and hot spot detection for CGH data using the fused lasso, Biostatistics, 2007

T. Kwartler, Intro to text mining using TM, OpenNLP and topic models, Open

Data Science Conference, 2015.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 49

References

C. Chen, B. He, Y. Ye and X. Yuan, The Direct Extension of ADMM for

Multi-block Convex Minimization Problems is Not Necessarily Convergent, Math. Programming, 2015.

N. Chatzipanagiotis, D. Dentcheva, M. M. Zavlanos, An augmented Lagrangian

method for distributed optimization, Mathematical programming, Ser. A, 152 (2015) No. 1, 405-434

P

. L. Combettes and J. Eckstein, Asynchronous block-iterative primal-dual decomposition methods for monotone inclusions, Mathematical Programming, published online 2016-07-05.

J. Eckstein, A Simplified Form of Block-Iterative Operator Splitting and an

Asynchronous Algorithm Resembling the Multi-Block ADMM, Optimization Online working paper 2016-7-5533, July 2016

K. Kiwiel. Methods of Descent for Nondifferentiable Optimization, volume 1133
f Lecture Notes in Mathematics. Springer-Verlag, Berlin, 1985.
H. Wang, A. Banerjee, and Zhi-Quan Luo, Parallel direction method of

multipliers, Neural Information Processing System (NIPS), 2014.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems

SLIDE 50

References

T. Lin, S. Ma and S. Zhang, On the Convergence Rate of Multi-Block ADMM,

submitted.

M. Hong, Z. Luo, On the linear convergence of the alternating direction method
f multipliers, Mathematical Programming, 2016.
W. Deng, M. Lai, Z. Peng, and W. Yin, Parallel multi-block ADMM with o(1/k)

convergence, UCLA CAM 13-64, 2013.

B.C. Vu. A splitting algorithm for dual monotone inclusions involving cocoercive
perators, Advances in Computational Mathematics, 38(3):667â ˘

A ¸ S681, 2013.

Y. Yu, Better approximation and faster algorithm using the proximal average,

Neural Information Processing Systems (NIPS), 2012.

X. Chen, Q. Lin, S. Kim, J. G. Carbonell, and E. P

. Xing, Smoothing proximal gradient method for general structured sparse regression, The Annals of Applied Statistics, 6: 719 – 752, 2012.

G. Ye, Y. Chen, X. Xie, Efficient variable selection in support vector machines via

the alternating direction method of multipliers. Proceedings of Machine Learning Research, 15:832-840, 2011.

D. Goldfarb, S. Ma, K. Scheinberg, Fast alternating linearization methods for

minimizing the sum of two convex functions. Mathematical Programming, (2013) 141: 349.

Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems