Selective Linearization Method for Statistical Learning Problems Yu - - PowerPoint PPT Presentation
Selective Linearization Method for Statistical Learning Problems Yu - - PowerPoint PPT Presentation
Selective Linearization Method for Statistical Learning Problems Yu Du yu.du@ucdenver.edu Joint work: Andrzej Ruszczynski DIMACS Workshop on ADMM and Proximal Splitting Methods in Optimization June 2018 Agenda Introduction to multi-block
Agenda
1
Introduction to multi-block convex non-smooth optimization Motivating examples Problem formulation
2
Review of related existing methods Proximal point and operator splitting methods Bundle method Alternating linearization method(ALIN)
3
Selective linearization (SLIN) for multi-block convex optimization SLIN method for multi-block convex optimization Global convergence Convergence rate
4
Numerical illustration Three-block fused lasso Overlapping group lasso Regularized support vector machine problem
5
Conclusions and ongoing work
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Motivating examples for multi-block structured regularization min
x∈Rn F(x) = f(x) + N
- i=1
hi(Bix)
Figure: [Zhou et al., 2015] Figure: [Demiralp et al., 2013]
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Motivating examples for multi-block structured regularization min
S∈S ||PΩ(S)−PΩ(A)||2 F +γ||RS||1 +τ||S||∗
Figure: [Zhou et al., 2015]
min
w
1 2Kλ
- b − Aw
- 2
2 + K
- j=1
dj
- wTj
- 2
Figure: [Demiralp et al., 2013]
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Problem formulation for multi-block convex optimization
Problem formulation
min
x∈❘n F(x) = f1(x) + f2(x) + . . . + fN(x)
f1, f2, ..., fN : ❘n → ❘ are convex functions. We introduced the selective linearization (SLIN) algorithm for multi-block non-smooth convex optimization. Global convergence is guaranteed; Almost O(1/k) convergence rate with only 1 out of N functions being strongly convex, where k is the iteration number.(Du, Lin, and Ruszczynski, 2017).
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Review of Related Existing Methods
Selective linearization algorithm comes from the idea of proximal point algorithm (Rockafellar, 1976) and operator splitting methods (Douglas and Rachford, 1956), (Lions and Mercier, 1979), and later by (Eckstein and Bertsekas, 1992) and (Bauschke and Combettes, 2011), bundle methods(Kiwiel, 1985) and (Ruszczynski, 2006) and Alternating linearzation method (Kiwiel, Rosa, and Ruszczynski, 1999). Proximal point method
min
x∈❘n F(x)
where F : ❘n → ❘ is a convex function. For solving the above problem, construct a proximal step proxF(xk) = argminx
- F(x) + ρ
2
- x − xk
- 2
and xk+1 = proxF(xk), k = 1, 2, . . . . It is known to converge to the minimum of F(·) (Rockafellar, 1976).
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Bundle method
The main idea of the bundle method (Kiwiel, 1985) and (Ruszczynski, 2006) is to replace problem minx∈❘n F(x) with a sequence of approximate problems of the form (cutting plane approximation):
min
x∈❘n ˜
Fk(x) + ρ 2
- x − xk
- 2
Here ˜ Fk(·) is a piecewise linear convex lower approximation of the function F(·).
˜
Fk(x) = max
j∈Jk
- F(zj) + gj, x − zj
- with some earlier generated solutions zj and subgradients gj ∈ ∂F(zj),
j ∈ Jk, where Jk ⊆ {1, . . . , k}. The solution of proximal step is subject to a sufficient improvement test, which decides the proximal center will change to current solution
- r not.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Bundle method(cont) ˜
Fk(x) = maxj∈Jk
- F(zj) + gj, x − zj
- Yu Du yu.du@ucdenver.edu
SLIN Methods for Statistical Learning Problems
Bundle method(cont)
Bundle method with multiple cuts Step 1: Initialization: Set k = 1, J1 = {1}, z1 = x1, and select g1 ∈ ∂F(z1). Choose parameter β ∈ (0, 1), and a stopping precision ε > 0. Step 2: zk+1 ← argmin{˜ Fk(x) + ρ
2
- x − xk
- 2}.
Step 3: if (F(xk) − ˜ Fk(zk+1) ≤ ε) stop, otherwise continue Step 4: Update Test: if (F(zk+1) ≤ F(xk) − β
- F(xk) − ˜
Fk(zk+1)
- ), then set
xk+1 = zk+1 (descent step); otherwise set xk+1 = xk (null step). Step 5:Select a set Jk+1 so that Jk ∪ {k + 1} ⊇ Jk+1 ⊇ {k + 1} ∪
- j ∈ Jk : F(zj) + gj, zk+1 − zj = ˜
Fk(zk+1)
- .
Increase k by 1 and go to Step 1.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Bundle method(cont)
Bundle method with cut aggregation Step 1: Initialization: Set k = 1, J1 = {1}, z1 = x1, and select g1 ∈ ∂F(z1). Choose parameter β ∈ (0, 1), and a stopping precision ε > 0. Step 2: zk+1 ← argmin{˜ Fk(x) + ρ
2
- x − xk
- 2}.
˜
Fk(x) = max
¯
Fk(x), F(zk) + gk, x − zk
- Step 3: if (F(xk) − ˜
Fk(zk+1) ≤ ε) stop, otherwise continue Step 4: if (F(zk+1) ≤ F(xk) − β
- F(xk) − ˜
Fk(zk+1)
- ), then set xk+1 = zk+1
(descent step); otherwise set xk+1 = xk (null step). Step 5: Define ¯ Fk+1(x) = θk ¯ Fk(x) + (1 − θk)
- F(zk) + gk, x − zk
- where
θk ∈ [0, 1] is such that the gradient of ¯
Fk+1(·) is equal to the subgradient of
˜
Fk(·) at zk+1 that satisfies the optimality conditions for the problem. Increase k by 1 and go to Step 1.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Bundle method(cont)
Convergence of the bundle method (in both versions) for convex functions is well-known (Kiwiel, 1985) and (Ruszczynski, 2006). Theorem Suppose Argmin F ∅ and ε = 0. Then a point x∗ ∈ Argmin F exists, such that:
lim
k→∞ xk = lim k→∞ zk = x∗.
Convergence Rate (Du and Ruszczynski, 2017) proved that the bundle method for nonsmooth
- ptimization achieves solution accuracy ǫ in at most O(
ln( 1
ǫ )
ǫ
) iterations, if
the function is strongly convex. The result is true for the versions of the method with multiple cuts and with cut aggregation.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Operator splitting for two block convex optimization
Operator Splitting
min
x∈❘n f(x) + h(x)
The solution ˆ x satisfies 0 ∈ ∂f(ˆ x) + ∂h(ˆ x), where we can consider two subdifferentials as two maximal monotone operators M1 and M2 on the space ❘n: 0 ∈ (M1 + M2)(ˆ x). Standard ADMM
min
x∈❘n f(x) + h(y)
s.t. Mx − y = 0 ADMM for solving the above takes the following form, for some scalar parameter c > 0: xk+1 ∈ argminx∈❘n{f(x) + g(yk) + λk, Mx − yk + c
2||Mx − yk||2}
yk+1 ∈ argminy∈❘m{f(xk+1) + g(y) + λk, Mxk+1 − y + c
2||Mxk+1 − y||2}
λk+1 = λk + c(Mxk+1 − yk+1).
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Alternating Linearization method for two block convex optimization
The Alternating Linearization Method (ALIN) (Kiwiel, Rosa, and Ruszczynski, 1999) adapted ideas of the operator splitting methods and bundle methods and proved the globlal convergence. ALIN (Lin, Pham and Ruszczy´ nski, 2014) is sucessfully applied to solve two block structured regularization statistical learning problems. Algorithm Alternating Linearization (ALIN)
min
x∈❘n f(x) + h(x)
1: repeat 2: ˜ xh ← argmin{˜ f(x) + h(x) + 1
2||x − ˆ
x||2
D}.
3: gh ← −gf − D(˜ xh − ˆ x) 4: if (Update Test for ˜ xh) then ˆ x ← ˜ xh end if 5: ˜ xf ← argmin{f(x) + ˜ h(x) + 1
2||x − ˆ
x||2
D}.
6: gf ← −gh − D(˜ xf − ˆ x) 7: if (Update Test for ˜ xf) then ˆ x ← ˜ xf end if 8: until (Stopping Test)
˜
f(x) is the linear approximation of f(x). D is a positive definite diagonal matrix.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Selective Linearization (SLIN) algorithm
Objective: min F(x) = N
i=1 fi(x)
f1, f2, . . . , fN : ❘n → ❘ are convex functions. They can be nonsmooth. Every iteration We choose an index j according to a selection rule (the largest gap between the function value and its linear approximation) and solve the fj-sub-problem:
min
x
fj(x) +
- ij
˜
fk
i (x) + 1
2||x − xk||2
D
Each ˜ fk
i (x) is a first-order linearization of fi(x).
xk is a proximal center. It will be updated over the iterations. After solving fj-sub-problem, fj will be linearized using its subgradient at current solution zk
j preparing for the next sub-problem:
˜
fk
i (x) = fi(zk i ) + gk i , x − zk i .
We denote the function approximating: ˜ Fk(x) = fj(x) +
ij ˜
fk
i (x).
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Selective Linearization (SLIN) algorithm
At the end of each iteration: Stopping rule : Check whether certain precision level ε is achieved. F(xk) − ˜ Fk(zk
j ) ≤ ε
Update rule: Check whether there is an sufficient improvement in F(·), then switch the proximal center to zk
j :
F(zk
j ) ≤ F(xk) − β
- F(xk) − ˜
Fk(zk
j )
- Derive the subgradient gk
fj of the fj from the optimality condition:
gk
fj = − ij gk i (x) − D(zk j − xk).
Find the next index j′( j) of the function to treat exactly at the next iteration: j′ = argmax
ij
- fi(zk
j ) − ˜
fk
i (zk j )
- .
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Global Convergence
We adapted the similar idea of proof of bundle methods to get the global convergence of SLIN method. We rely on the assumption that the subgradients of a convex function are locally bounded. Global Convergence results Suppose Argmin F ∅ and ε = 0. Then a point x∗ ∈ Argmin F exists, such that:
lim
k→∞ zk j = lim k→∞ xk = x∗;
lim
k→∞ ηk = lim k→∞ F(xk) = F(x∗).
where optimal objective function value of subproblem at iteration k
ηk = minx fj(x) +
ij ˜
fk
i (x) + 1 2
- x − xk
- 2
D.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Global Convergence (cont)
We first proved that the optimal objective function values of consecutive subproblems ηk are increasing. And the gap is bounded below by a quantity vk = F(xk) − ˜ Fk(zk
jk ).
Lemma If a null step is made at iteration k, then
ηk+1 ≥ ηk +
1 − β 2(N − 1)¯
µkvk,
where
¯ µk = min
- 1,
(1 − β)vk (N − 1)sk
jk+1 − gk jk+12 D−1
- ,
with an arbitrary sk
jk+1 ∈ ∂fjk+1(zk jk ).
Assumption There is a constant M exists, such that sk
jk+1 − gk jk+12 D−1 < M.
And ǫ ≤ (N − 1)M .
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Global Convergence (cont)
We also need to estimate the size of the steps made by the method. Lemma At every iteration k, FD(xk) − ηk ≥ 1
2
- zk
jk − proxF(xk)
- 2
D.
where FD(y) = min
- F(x) + 1
2
- x − y
- 2
D
- , and
proxF(y) = argmin
- F(x) + 1
2
- x − y
- 2
D
- .
We prove convergence in the two cases with finitely or infinitely many descent steps. Theorem Suppose ε = 0, the set K = {1} ∪ {k > 1 : xk xk−1} is finite and
inf F > −∞. Let m ∈ K be the largest index such that xm xm−1. Then
xm ∈ Argmin F. Theorem Suppose Argmin F ∅. If the set K = {k : xk+1 xk} is infinite, then
limk→∞ xk = x∗, for some x∗ ∈ Argmin F.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Convergence rate
Convergence rate Almost O( 1
k ) convergence rate if F(x) is strongly convex with parameter α
(a sufficient condition is that one fi(x) is strongly convex): Bound the number of proximal centers generated for precision level ǫ. Bound the number of iterations for any fixed proximal center. Assumption The function F(·) has a unique minimum point x∗ and a constant α > 0 exists, such that F(x) − F(x∗) ≥ α
- x − x∗
- 2,
for all x ∈ ❘n with F(x) ≤ F(x1).
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Convergence rate (cont)
Lemma Then at every iteration k we have F(xk) − F(x∗) ≤ F(xk)−ηk
min(α,1) .
By virtue of lemma, strongly convex assumption and update rule, we proved linear rate of convergence between descent steps: Lemma At every descent step k we have F(xk+1) − F(x∗) ≤ (1 − ¯
αβ)
- F(xk) − F(x∗)
- ,
where ¯
α = min(α, 1).
By virtue of update rule and stopping rule, the proximal center is updated
- nly if, F(xk) − F(x∗) ≥ βε we obtain number of descent steps L:
L ≤ 1 +
ln(βε) − ln
- F(x1) − F(x∗)
- ln(1 − ¯
αβ) .
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Convergence rate (cont)
First, we prove linear rate of convergence between descent steps.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Convergence rate (cont)
We now pass to the second issue: the estimation of the number of null steps between two consecutive descent steps. We shall base it on the analysis of the gap F(xk) − ηk. Lemma If a null step occurs at iteration k, then F(xk) − ηk+1 ≤ γ
- F(xk) − ηk
where γ = 1 − (1−β)2ε
2M
.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Convergence rate (cont)
Let x(ℓ−1), x(ℓ), x(ℓ+1) be three consecutive proximal centers for ℓ ≥ 2 in the
- algorithm. We want to bound the number of iterations made with proximal
center x(ℓ). To this end, we bound two quantities: F(x(ℓ)) − ηk(ℓ), where k(ℓ) is the first step with proximal center x(ℓ) F(x(ℓ)) − ηk ′(ℓ), where k ′(ℓ) is the last step with proximal center x(ℓ). Lemma If a descent step is made at iteration k(ℓ) − 1, then F(x(ℓ)) − ηk(ℓ) ≤ 3 2β
- F(x(ℓ−1)) − F(x(ℓ))
- .
From previous lemma F(xk) − F(x∗) ≤ F(xk)−ηk
min(α,1) , we obtain the following
inequality at every (including the last) null step with prox center x(ℓ): F(x(ℓ)) − ηk ′(ℓ) ≥ ¯
α
- F(x(ℓ)) − F(x∗)
- ≥ ¯
α
- F(x(ℓ)) − F(xℓ+1)
- .
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Convergence rate (cont)
Therefore, the number nℓ of null steps with proximal center x(ℓ), if it is positive, satisfies the inequality: 3 2β
- F(x(ℓ−1)) − F(x(ℓ))
- γnℓ−1 ≥ ¯
α
- F(x(ℓ)) − F(x(ℓ+1))
- .
Then we obtain the following upper bound on the number of null steps for
ℓ ≥ 2:
nℓ ≤ 1 + 1
ln(γ) ln
2β¯
α
3 F(x(ℓ)) − F(x(ℓ+1)) F(x(ℓ−1)) − F(x(ℓ))
.
For the first series, we obtain the bound n1 ≤ 1 + 1
ln(γ) ln ¯ αF(x(1)) − F(x(2))
F(x(1)) − η1
.
For the last series, we obtain nL ≤ 1 + 1
ln(γ) ln β
3
ε
F(x(L−1)) − F(x(L))
- .
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Convergence rate (cont)
We aggregate the total number of null steps (with good cancellation rule):
L
- ℓ=1
nℓ ≤ L − 1
ln(γ)
- ln(¯
α) + ln 2β¯ α
3
- + ln
β
3
- +
1 L − 1 ln
- ε
F(x1) − η1
- + L.
Remind the previous result of number of descent steps: L ≤ 1 +
ln(βε) − ln
- F(x1) − F(x∗)
- ln(1 − ¯
αβ) .
As a result, we have the final bound for the total number of descent and null steps, where C =
(1−β)2 2M(N−1)2 :
L +
L
- ℓ=1
nℓ
≤
1
εC ln(1 − ¯ αβ) ln F(x1) − F(x∗) βε ln(¯ α) + ln 2β¯ α
3
- + ln
β
3
+ 1 εC ln F(x1) − η1 ε
- + 2
ln(βε) − ln
- F(x1) − F(x∗)
- ln(1 − ¯
αβ) + 2.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Convergence rate (cont)
Therefore in order to achieve precision ε, the number of steps needed is of
- rder
L +
L
- ℓ=1
nℓ ∼ O
1
ε ln 1 ε .
This is almost equivalent to saying that given the number of iterations k, the precision of the solution is approximately O(1/k)(Du, Lin and Ruszczynski, 2017).
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Experiment: Three-block fused lasso problem min
x
1 2||Ax − b||2
- f1(x)
+ λ1||x||1
- f2(x)
+ λ2
- j
||xj+1 − xj||1
- f3(x)
where matrix A has the dimension n × p. gfi denotes the (sub-)gradient of the fi-subproblem evaluated at zi. D = diag(ATA).
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Subproblems
f1-subproblem
min
x
1 2||b − Ax||2
2 + gT f2x + gT f3x + 1
2||x − ˆ x||2
D.
This is an unconstrained quadratic optimization problem and can be solved efficiently by preconditioned conjugate gradient method. f2-sub problem
min
x
gT
f1x + λ1||x||1 + gT f3x + 1
2||x − ˆ x||2
D.
Closed form solution: (xf2)i = sgn(τi) max(0, |τi| − λ1
di ), i = 1, . . . , m.
Here τi = (x0)i −
(gf1)i+(gf3)i di
.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Subproblems (cont)
f3-sub problem
min
x
gT
f1x + gT f2x + λ2
- j
||xj+1 − xj||1 + 1
2||x − ˆ x||2
D.
According to (Lin, Pham and Ruszczy´ nski, 2014), this problem can be treated as a box-constrained quadratic programming problem. It’s equivalent to solving the following box-constrained quadratic programming problem:
max
µ
−1
2µTRD−1RTµ + µTR(x0 − D−1gf1 − D−1gf2), subject to ||µ||∞ ≤ λ2, where ||Rx||1 =
j ||xj+1 − xj||1. It can be solved by a box-constrained
conjugate gradient method or block coordinate descent method.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Comparison with other related methods
Method Selection Update Proximal Relaxation Label Rule Rule Parameter D Parameter δ SLIN
- Alternating Linearization
- DRc
- DRc over relax
- DRc under relax
- DRρ
Table: Main features comparison
*over relax: δ = 1.5; under relax: δ = 0.5; DRρ : ρ = constant values.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Comparison with other related methods(Cont)
100 101 102 103 Iterations
- 10
- 5
5 10 15 20 ln(F(x) - F(x*)) SLIN Cyclical Linearization DRC DRC over relax DRC under relax DR Vu splitting
Figure: Comparison of methods on problems with n = 10000 and p = 1000.
100 101 102 103 Iterations
- 10
- 5
5 10 15 20 ln(F(x) - F(x*)) SLIN Cyclical Linearization DRC DRC over relax DRC under relax DR Vu splitting
Figure: Comparison of methods on problems with n = 3000 and p = 4000.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Comparison with other related methods (Cont)
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Sample size 5 10 15 20 25 30 35 40 45 Running time SLIN Cyclical Linearization DRC DRC over relax DRC under relax DR Vu splitting
Figure: Running time of SLIN and other methods as sample size changes when p = 1000.
500 1000 1500 2000 2500 3000 3500 4000 Variable dimension 100 200 300 400 500 600 Running time SLIN Cyclical Linearizaton DRC DRC over relax DRC under relax DR Vu splitting
Figure: Running time of SLIN and other methods as dimension changes when n = 3000.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Experiment: Overlapping group lasso
Overlapping group lasso
min
x
1 2Kλ
- b − Ax
- 2
2
- f0(x)
+
K
- j=1
dj
- xGj
- 2
- fj(x)
where matrix A has the dimension n × p.
Gj ⊆ {1, . . . , p} is the index set of a group (subset) of variables and xGj
denotes a copy of x with all variables not contained in the group Gj set to 0. There are K groups. We adopt the uniform weight dj = 1/K and set λ = K/5. D =
1 Kλdiag(ATA)
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Subproblems
f0-subproblem
min
x
1 2Kλ
- b − Ax
- 2
2 + K
- j=1
gT
fj x + 1
2
- x − ˆ
x
- 2
D.
This is a unconstrained quadratic optimization problem. Its optimal solution can be obtained by solving the following linear system of equations by preconditioned conjugate gradient method:
1
KλATA + D
- x =
1 KλATb −
K
- j=1
gfj + Dˆ x. fj-subproblem
min dj
- xGj
- 2 + s, x + 1
2
- x − ˆ
x
- 2
D.
where s = gf0 +
j′j gfj′.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Solving fj-subproblems
fj-subproblem The decision variables that are outside of current group Gj have closed-form solution: x−Gj = ˆ x−Gj − D−1
−Gjs−Gj.
The variables in the current group Gj is solved from the first order conditions
- f fj-subproblem if xGj 0
djxGj
- xGj
- 2
+ sGj + DGj(xGj − ˆ
xGj) = 0. We denote
dj
- xGj
- 2
= γ. This leads to the following equation for γ:
- i∈Gj
- Diixk
i −si
1+
Dii γ
2 = d2
j . Letting γ → ∞ on the left hand side, we obtain the
condition for the existence of a solution
i∈Gj
- Diixk
i − si
2 > d2
j . If above
inequality is satisfied, κ can be found by bisection. Otherwise the optimal solution is xGj = 0.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Overlapping group lasso: Tree-structured overlapping groups min
w
1 2Kλ
- b − Aw
- 2
2 + K
- j=1
dj
- wTj
- 2
Table: Comparison of SLIN and FISTA on tree-structured overlapping group lasso problem.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Overlapping group lasso: fixed-order groups
We define a sequence of groups of 100 adjacent inputs with an overlap of 10 variables between two successive groups so that G = {{1, . . . , 100}, {91, . . . , 190}, . . . , {p − 99, . . . , p}}, p = 90|G| + 10 Comparing SLIN with other algorithms when K=100, n = 1000:
100 101 102 103 104 Iterations 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 Objective value SLIN PDMM PA-APG S-APG sADMM
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Overlapping group lasso: fixed-order groups(Cont’d)
10 20 30 40 50 60 70 80 90 100 group number 10 20 30 40 50 60 70 Running time SLIN PDMM sADMM PA-APG S-APG
Figure: Running time of SLIN and other methods as group number changes when n = 1000.
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Sample size 50 100 150 200 250 300 350 Running time SLIN PDMM sADMM PA-APG S-APG
Figure: Running time of SLIN and other methods as sample size changes when K = 100.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Overlapping group lasso: randomly generated groups min
w
1 2Kλ
- b − Aw
- 2
2 + K
- j=1
dj
- wGj
- 2
Table: Comparison SLIN and PDMM in solving an overlapping group lasso of randomly generated groups.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Experiment: Regularized support vector machine problem
Consider the regularized SVM problem when the loss functions are sum of nonsmooth functions:
min
x0,x
1 n
n
- i=1
(1 − bj(x0 + AT
j x))+ + λ1
- x
- 1 + λ2
2
- x
- 2
2
where λ1, λ2 > 0 are regularization parameters. Due to the properties of the elastic net regularization, the optimal solution will have both parse and grouping effect. The f-subproblem(Option 1). The f-subproblem has the form:
min
x,v
1 n
n
- j=1
vj + gT
h1x + gT h2x + 1
2
- x − xk
- 2
D.
vj ≥ 1 − bj(x0 + AT
j x),
j = 1, . . . , n. vj ≥ 0 This is a constrained quadratic programming problem and its optimal solution can be obtained by active sets method. Note vj = (1 − bj(x0 + AT
j x))+.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
The fj-subproblem(Option 2). Solving for each block j, the fj-subproblem has the form:
min
x,vj
1 nvj + 1 n
n+1
- ij
lT
i x + 1
2
- x − xk
- 2
D.
vj ≥ 1 − bj(x0 + AT
j x)
vj ≥ 0 where n+1
ij
li =
ij gT fj + gT h1 + gT h2; with gfj denoting a subgradient of the
function fj. To solve this above problem, we consider three cases.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Case 1: when vj = 0, we solve in closed form in x, and check the condition if 0 ≥ 1 − bj(x0 + AT
j x)
condition satisfies, otherwise go to case 2. Case 2: when vj ≥ 0, Then vj = 1 − bj(x0 + AT
j x), then we solve the following unconstrained
problem
min
x
1 n(1 − bj(x0 + AT
j x)) + 1
n
n+1
- ij
lT
i x + 1
2
- x − xk
- 2
D.
We solve for x in closed form and calculate vj. If vj > 0, we found the solutions, otherwise, go to case 3. Case 3: when vj = 1 − bj(x0 + AT
j x) = 0,
we solve the following constrained problem
min
x
1 n
n+1
- ij
lT
i x + 1
2
- x − xk
- 2
D.
0 = 1 − bj(x0 + AT
j x)
We calculate x in closed form as a search for lagrangian multiplier µ. Then check if µ matches for the linear constraint.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Experiment: Regularized support vector machine problem
n, p Method
ρ = 0 ρ = 0.8
n=50 SLIN 0.22 0.13 p=300 ENSVM-ADMM 0.51 0.41 n=100 SLIN 0.65 0.43 p=500 ENSVM-ADMM 1.29 0.81 n=200 SLIN 1.15 1.34 p=1000 ENSVM-ADMM 3.71 3.85
Table: Time comparison of SLIN and ADMM on support vector machine problems
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Conclusions
We introduce the selective linearization (SLIN) algorithm for multi-block non-smooth convex optimization. It proposes one operator-splitting type methods which are globally convergent for an arbitrary number of
- perators without artificial duplication of variables.
Under a strongly convex condition, SLIN algorithm is proved to converge at rate of almost O( 1
k ). It is a major contribution even in the
case of two blocks. The technique invented by us can be also used to derive the rate of convergence of the classical bundle method and ALIN method, for which no convergence rate estimate has been available so far. We have done extensive comparison experiments in statistical learning problems such as fused lasso regularization problems and overlapping group lasso problems. The experiment results demonstrate the efficacy and accuracy of the method.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Ongoing work
Study the convergence rate of the SLIN algorithm on general convex, but not strongly convex objective functions. Extend SLIN to solve non-convex optimization problems. Design the algorithm for multi-block convex problem with linear
- perators and analyze the convergence pattern of the algorithm for
general constrained problems:
min
x∈X F(x) = f1(x) + N
- i=2
fi(Mx) Design the stochastic version of SLIN method for solving large scale problems including recommendation matrix completion, medical imaging problem, dictionary learning, and some deep learning problems, etc, to better understand its practical performance.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
Related papers
- Y. Du, X. Lin, A. Ruszczynski, A Selective Linearization Method for
Multiblbock Convex Optimization, SIAM Journal on Optimization 27 (2017), 1102-1117. https://doi.org/10.1137/15M103217X
- Y. Du, A. Ruszczynski, Rate of convergence of the bundle method,
Journal of Optimization Theory and Applications 173 (2017), 908–922. https://doi.org/10.1007/s1095
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
References
- A. Ruszczy´
nski, Nonlinear optimization, Princeton University Press, Princeton, NJ, 2006.
- K. Kiwiel, C. Rosa, and A. Ruszczy´
nski, Proximal decomposition via alternating linearization, SIAM Journal on Optimization, 9:153-172, 1999.
- X. Lin, M. Pham and A. Ruszczy´
nski, Alternating linearization for structured regularization problems, Journal of Machine Learning Research, 15 (2014), 3447-3481.
- R. Jenatton, J. Mairal, G. Obozinski, and F. Bach, Proximal methods for
hierarchical sparse coding, Journal of Machine Learning Research, 12 (2011), 2297-2334.
- H. Bauschke and P
. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces, CMS Books in Mathematics/Ouvrages de Math Ì ˛ Aematiques de la SMC. Springer, New York, 2011.
- J. Eckstein and D. P
. Bertsekas. On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators, Math. Programming, 55(3, Ser. A):293-318, 1992.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
References
- S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization
and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1â ˘ A ¸ S122, 2010.
- J. Huang, S. Zhang and D. Metaxas, Efficient MR Image Reconstruction for
Compressed MR Imaging, Medical Image Analysis (MedIA), Volume 15, Issue 5, Pages 670-679, 2011.
- C. Demiralp, E. Hayden, J. Hammerbacher, and J. Heer, Exploring
High-Dimensional RNA Sequences from In Vitro Selection, IEEE Biological Data Visualization (BioVis), 2013
- J. Zhou, P
. Gong, Z. Wang, and J. Ye, Mining Structured Sparsity Beyond Convexity, ICDM Tutorial, 2015
- R. Tibshirani and P
. Wang, Spatial smoothing and hot spot detection for CGH data using the fused lasso, Biostatistics, 2007
- T. Kwartler, Intro to text mining using TM, OpenNLP and topic models, Open
Data Science Conference, 2015.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
References
- C. Chen, B. He, Y. Ye and X. Yuan, The Direct Extension of ADMM for
Multi-block Convex Minimization Problems is Not Necessarily Convergent, Math. Programming, 2015.
- N. Chatzipanagiotis, D. Dentcheva, M. M. Zavlanos, An augmented Lagrangian
method for distributed optimization, Mathematical programming, Ser. A, 152 (2015) No. 1, 405-434
- P
. L. Combettes and J. Eckstein, Asynchronous block-iterative primal-dual decomposition methods for monotone inclusions, Mathematical Programming, published online 2016-07-05.
- J. Eckstein, A Simplified Form of Block-Iterative Operator Splitting and an
Asynchronous Algorithm Resembling the Multi-Block ADMM, Optimization Online working paper 2016-7-5533, July 2016
- K. Kiwiel. Methods of Descent for Nondifferentiable Optimization, volume 1133
- f Lecture Notes in Mathematics. Springer-Verlag, Berlin, 1985.
- H. Wang, A. Banerjee, and Zhi-Quan Luo, Parallel direction method of
multipliers, Neural Information Processing System (NIPS), 2014.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems
References
- T. Lin, S. Ma and S. Zhang, On the Convergence Rate of Multi-Block ADMM,
submitted.
- M. Hong, Z. Luo, On the linear convergence of the alternating direction method
- f multipliers, Mathematical Programming, 2016.
- W. Deng, M. Lai, Z. Peng, and W. Yin, Parallel multi-block ADMM with o(1/k)
convergence, UCLA CAM 13-64, 2013.
- B.C. Vu. A splitting algorithm for dual monotone inclusions involving cocoercive
- perators, Advances in Computational Mathematics, 38(3):667â ˘
A ¸ S681, 2013.
- Y. Yu, Better approximation and faster algorithm using the proximal average,
Neural Information Processing Systems (NIPS), 2012.
- X. Chen, Q. Lin, S. Kim, J. G. Carbonell, and E. P
. Xing, Smoothing proximal gradient method for general structured sparse regression, The Annals of Applied Statistics, 6: 719 – 752, 2012.
- G. Ye, Y. Chen, X. Xie, Efficient variable selection in support vector machines via
the alternating direction method of multipliers. Proceedings of Machine Learning Research, 15:832-840, 2011.
- D. Goldfarb, S. Ma, K. Scheinberg, Fast alternating linearization methods for
minimizing the sum of two convex functions. Mathematical Programming, (2013) 141: 349.
Yu Du yu.du@ucdenver.edu SLIN Methods for Statistical Learning Problems