An adaptive backtracking strategy for non-smooth composite - - PowerPoint PPT Presentation
An adaptive backtracking strategy for non-smooth composite - - PowerPoint PPT Presentation
An adaptive backtracking strategy for non-smooth composite optimisation problems Luca Calatroni ees (CMAP), Centre de Mathematiqu es Appliqu Ecole Polytechnique, Palaiseau joint work with: A. Chambolle. CMIPI 2018 Workshop University
Table of contents
- 1. Introduction
- 2. GFISTA with backtracking
- 3. Accelerated convergence rates
- 4. Imaging applications
- 5. Conclusions & outlook
1
Introduction
Gradient based methods: a review
(X, · ), Hilbert space. Given f : X → R convex, l.s.c., with x∗ ∈ arg min f , we want to solve: min
x∈X f (x) 2
Gradient based methods: a review
(X, · ), Hilbert space. Given f : X → R convex, l.s.c., with x∗ ∈ arg min f , we want to solve: min
x∈X f (x)
If f is differentiable with Lf -Lipschitz gradient, explicit gradient descent reads:
Algorithm 1 Gradient descent with fixed step.
Input: 0 < τ ≤ 2/Lf , x0 ∈ X. for k ≥ 0 do xk+1 = xk − τ∇f (xk) end for
Quite restrictive smoothness assumption!
2
Gradient based methods: a review
(X, · ), Hilbert space. Given f : X → R convex, l.s.c., with x∗ ∈ arg min f , we want to solve: min
x∈X f (x)
No further assumptions on ∇f : use implicit gradient descent.
Algorithm 2 Implicit (proximal) gradient descent with fixed step.
Input: τ > 0, x0 ∈ X. for k ≥ 0 do xk+1 = proxτf (xk)(= xk − τ∇f (xk+1)) end for
Note: the iteration can be rewritten as: xk+1 = xk − τ∇fτ(xk), with fτ(xk) := min
x∈X f (x) + x − xk2
2τ , the Moreau-Yosida regularisation of f , which is 1/τ-Lipschitz ⇒ explicit gradient descent on fτ. Same theory applies!
References: Brezis-Lions (’73, ’78), G¨ uler (’91),. . . 2
Convergence rates
Theorem: O(1/k) rate Let x0 ∈ X and τ ≤ 2/Lf . Then, the sequence (xk) of iterates of gradient descent converges to x∗ and satisfies: f (xk) − f (x∗) ≤ 1 2τk x∗ − x02.
3
Convergence rates
Theorem: O(1/k) rate Let x0 ∈ X and τ ≤ 2/Lf . Then, the sequence (xk) of iterates of gradient descent converges to x∗ and satisfies: f (xk) − f (x∗) ≤ 1 2τk x∗ − x02. Assume: f is µf -strongly convex, µf > 0: f (y) ≥ f (x) + ∇f (x), y − x + µf 2 y − x2, for all x, y ∈ X. Theorem: Linear rate for strongly convex objectives Let f be µf -strongly convex. Let x0 ∈ X and τ ≤ 2/(Lf + µf ). Then, the sequence (xk) of iterates of gradient descent satisfies: f (xk) − f (x∗) + 1 2τ xk − x∗2 ≤ ωk 2τ x∗ − x02, with ω = (1 − µf /L)/(1 + µf L) < 1. References: Bertsekas, ’15, Nesterov ’04
3
Lower bounds1
Theorem (Lower bounds)
Let x0 ∈ Rn, Lf > 0 and k < n. Then, for any first-order method there exists a convex C 1 function f with Lf -Lipschitz gradient such that:
- 1. convex case:
f (xk) − f (x∗) ≥ Lf 8(k + 1)2 x∗ − x02.
- 2. strongly convex case:
f (xk) − f (x∗) ≥ µf 2 √q − 1 √q + 1 2k x∗ − x02, where q = Lf /µf ≥ 1. Remark: If k ≥ n we could use conjugate gradient! However, for imaging n ≫ 1!
Usually k < n: can we improve convergence speed?
1Nesterov, ’04
4
Nesterov acceleration for gradient descent2
To make it faster build extrapolated sequence (inertia). Algorithm 3 Nesterov accelerated gradient descent with fixed step.
Input: 0 < τ ≤ 1/Lf , x0 = x−1 = y 0 ∈ X, t0 = 0. for k ≥ 0 do tk+1 = 1 +
- 1 + 4t2
k
2 y k = xk + tk − 1 tk+1 (xk − xk−1) xk+1 = y k − τ∇f (y k) end for
2Nesterov, ’83, ’04, G¨
uler ’92
5
Nesterov acceleration for gradient descent2
Algorithm 4 Nesterov accelerated gradient descent with fixed step.
Input: 0 < τ ≤ 1/Lf , x0 = x−1 = y 0 ∈ X, t0 = 0. for k ≥ 0 do tk+1 = 1 +
- 1 + 4t2
k
2 y k = xk + tk − 1 tk+1 (xk − xk−1) xk+1 = y k − τ∇f (y k) end for
Theorem (Acceleration) Let τ ≤ 1/Lf and (xk) the sequence generated by the accelerated gradient descent algorithm. Then: f (xk) − f (x∗) ≤ 2 τ(k + 1)2 x0 − x∗2.
2Nesterov, ’83, ’04, G¨
uler ’92
5
Standard problem in imaging: composite structure
Variational regularisation of ill-posed inverse problems
Compute a reconstructed version of a given degraded image f by solving: min
u∈X
{F(x) := R(u) + λD(u, f )} , λ > 0 with non-smooth regularisation and smooth data fidelity.
6
Standard problem in imaging: composite structure
Variational regularisation of ill-posed inverse problems
Compute a reconstructed version of a given degraded image f by solving: min
u∈X
{F(x) := R(u) + λD(u, f )} , λ > 0 with non-smooth regularisation and smooth data fidelity. Examples in inverse problems/imaging:
- R(u) = TV , ICTV , TGV , ℓ1 (Rudin, Osher, Fatemi, ’92, Chambolle-Lions’ 97,
Bredies, ’10)
- D(u, f ) = u − f 2
2 (Gaussian Rudin, Osher, Fatemi, ’92 ), D(u, f ) = u − f 1,γ
(Laplace/impulse, Nikolova, ’04), D(u, f ) = KLγ(u, f ) (Poisson, Burger, Sawatzky, Brune, M¨ uller, ’09). . .
6
Composite optimisation
We want to solve: min
x∈X {F(x) := f (x) + g(x)}
- f is smooth: differentiable, convex with Lipschitz gradient
∇f (y) − ∇f (x) ≤ Lf y − x, for any x, y ∈ X.
- g is convex, l.s.c., non-smooth, easy proximal map.
3Combettes, Ways, ’05, Nesterov, ’13. . .
7
Composite optimisation
We want to solve: min
x∈X {F(x) := f (x) + g(x)}
- f is smooth: differentiable, convex with Lipschitz gradient
∇f (y) − ∇f (x) ≤ Lf y − x, for any x, y ∈ X.
- g is convex, l.s.c., non-smooth, easy proximal map.
Composite optimisation problem Forward-Backward splitting3.
- forward gradient descent step in f ;
- backward implicit gradient descent step in g.
Basic algorithm: take x0 ∈ X, fix τ > 0 and for k ≥ 0 do: xk+1 = proxτg (xk − τ∇f (xk)) =: Tτxk.
3Combettes, Ways, ’05, Nesterov, ’13. . .
7
Composite optimisation
We want to solve: min
x∈X {F(x) := f (x) + g(x)}
- f is smooth: differentiable, convex with Lipschitz gradient
∇f (y) − ∇f (x) ≤ Lf y − x, for any x, y ∈ X.
- g is convex, l.s.c., non-smooth, easy proximal map.
Composite optimisation problem Forward-Backward splitting3.
- forward gradient descent step in f ;
- backward implicit gradient descent step in g.
Basic algorithm: take x0 ∈ X, fix τ > 0 and for k ≥ 0 do: xk+1 = proxτg (xk − τ∇f (xk)) =: Tτxk.
Rate of convergence: O(1/k).
3Combettes, Ways, ’05, Nesterov, ’13. . .
7
Accelerated forward-backward, FISTA: previous work
In Nesterov ’04 and Beck, Teboulle ’09, accelerated O(1/k2) convergence of is achieved by extrapolation (as above). Further properties:
- convergence of iterates (Chambolle, Dossal ’15);
- monotone variants (Beck, Teboulle ’09, Tseng ’08, Tao, Boley, Zhang ’15)
- acceleration for inexact evaluation of operators (Villa, Salzo, Baldassarre,
Verri ’13, Bonettini, Prato, Rebegoldi, ’18)
8
Accelerated forward-backward, FISTA: previous work
In Nesterov ’04 and Beck, Teboulle ’09, accelerated O(1/k2) convergence of is achieved by extrapolation (as above). Further properties:
- convergence of iterates (Chambolle, Dossal ’15);
- monotone variants (Beck, Teboulle ’09, Tseng ’08, Tao, Boley, Zhang ’15)
- acceleration for inexact evaluation of operators (Villa, Salzo, Baldassarre,
Verri ’13, Bonettini, Prato, Rebegoldi, ’18) Questions
- 1. Can we say more when f and/or g are strongly convex? Linear
convergence?
- 2. Can we let the gradient step (proximal parameter) vary along the
iterations AND preserving acceleration?
8
A strongly convex variant of FISTA (GFISTA)
Let µf , µg ≥ 0. Then µ = µf + µg. For τ > 0 define: q := τµ 1 + τµg ∈ [0, 1). Algorithm 5 GFISTA4 (no backtracking)
Input: 0 < τ ≤ 1/Lf , x0 = x−1 ∈ X and let t0 ∈ R s.t. 0 ≤ t0 ≤ 1/√q. for k ≥ 0 do y k = xk + βk(xk − xk−1) xk+1 = Tτ y k = proxτg(y k − τ∇f (y k)) tk+1 = 1 − qt2
k +
- (1 − qt2
k )2 + 4t2 k
2 βk = tk − 1 tk+1 1 + τµg − tk+1τµ 1 − τµf end for
Remark: µ = q = 0 = ⇒ standard FISTA.
4Chambolle, Pock ’16
9
GFISTA: acceleration results
Theorem [Chambolle, Pock ’16] Let τ ≤ 1/Lf and 0 ≤ t0√q ≤ 1. Then, the sequence (xk) of iterates of GFISTA satisfies F(xk) − F(x∗) ≤ rk(q)
- t2
0(F(x0) − F(x∗)) + 1 + τµg
2 x − x∗2
- ,
where x∗ is a minimiser of F and: rk(q) = min
- 4
(k + 1)2 , (1 + √q)(1 − √q)k, (1 − √q)k t2
- .
Note: for µ = q = 0, t0 = 0 this is the standard FISTA convergence result.
10
GFISTA: acceleration results
Theorem [Chambolle, Pock ’16] Let τ ≤ 1/Lf and 0 ≤ t0√q ≤ 1. Then, the sequence (xk) of iterates of GFISTA satisfies F(xk) − F(x∗) ≤ rk(q)
- t2
0(F(x0) − F(x∗)) + 1 + τµg
2 x − x∗2
- ,
where x∗ is a minimiser of F and: rk(q) = min
- 4
(k + 1)2 , (1 + √q)(1 − √q)k, (1 − √q)k t2
- .
Note: for µ = q = 0, t0 = 0 this is the standard FISTA convergence result.
Question: What if an estimate of Lf is not available? Backtracking!
10
Backtracking idea
For plain 2D gradient descent:
Too small VS. too big τ
11
Backtracking idea
For plain 2D gradient descent:
Armijo line-search
11
Backtracking idea
For plain 2D gradient descent:
Armijo line-search Armijo rule: Choose τk = 1/2i where i ∈ N is the minimum integer s.t. f (xk+1) − f (xk) ≤ βτk∇f (xk)⊤(xk+1 − xk), 0 < β < 1. FISTA + backtracking:
- Armijo-type (Beck, Teboulle, ’09): τk+1 ≤ τk for every k.
- Full backtracking (Scheinberg, Goldfarb, Bai, ’14): larger steps in “flat” areas!
11
GFISTA with backtracking
Backtracking strategy and Bregman distance
General idea: check if for every x ∈ X: F(xk+1) + (1 + τµg)x − xk+12 2τ + xk+1 − y k2 2τ − Df (xk+1, y k)
- ≤ F(x) + (1 − τµf )x − y k2
2τ , where Df (xk+1, y k) := f (xk+1) − f (y k) − ∇f (y k), y k − xk+1 is the Bregman distance of f between xk+1 = Tτy k and y k.
12
Backtracking strategy and Bregman distance
General idea: check if for every x ∈ X: F(xk+1) + (1 + τµg)x − xk+12 2τ + xk+1 − y k2 2τ − Df (xk+1, y k)
- ≤ F(x) + (1 − τµf )x − y k2
2τ , where Df (xk+1, y k) := f (xk+1) − f (y k) − ∇f (y k), y k − xk+1 is the Bregman distance of f between xk+1 = Tτy k and y k. Constant steps Such condition is verified as long as: τ ≤ xk+1 − y k2 2Df (xk+1, y k) ∼ 1 Lk , (*) which is always true if τ ≤ 1/Lf with known estimate Lf . However, one can alternatively check (*) along the iterations. This corresponds to compute a local Lipschitz Constant Estimate (LCE).
12
GFISTA with backtracking: Algorithm
For any k ≥ 0 we let τ = τk and define: τ ′
k =
τk 1 − τkµf , qk = µτk 1 + τkµg . Update rule for extrapolation: for any k ≥ 0 set tk+1 = 1 −
qk+1 1−qk+1 τ′
k
τ′
k+1 t2
k +
- qk+1
1−qk+1 τ′
k
τ′
k+1 t2
k − 1
2 + 4
τ′
k
τ′
k+1
t2
k
1−qk+1
2
13
GFISTA with backtracking: Algorithm
Algorithm 6 GFISTA with backtracking
Input: µf , µg, τ0 > 0, q0, ρ ∈ (0, 1), x0 = x−1 ∈ X and t0 ∈ R s.t. 0 ≤ t0 ≤ 1/√q0. for k ≥ 0 do y k = xk + βk(xk − xk−1). Set ibt = 0; if too close to LCE then while Backtracking condition (*) is not verified & ibt ≤ imax do keep/reduce step-size: τk+1 = ρibt τk; Compute xk+1 = Tτk+1 y k = proxτk+1g(y k − τk+1∇f (y k)) (1) ibt = ibt + 1; end while else if far enough from LCE then increase step-size: τk+1 =
τk ρ ;
Compute xk+1 using (1); end if Update qk+1, τ ′
k+1, tk+1.
Set βk+1 = 1 − qk+1tk+1 1 − qk+1 tk − 1 tk+1 . end for
Too close/too far: how tight is (*)? Reduce costs due to (1).
14
Analogies/differences with FISTA-type algorithms: update rule
tk+1 = 1 −
qk+1 1−qk+1 τ′
k
τ′
k+1 t2
k +
- qk+1
1−qk+1 τ′
k
τ′
k+1 t2
k − 1
2 + 4
τ′
k
τ′
k+1
t2
k
1−qk+1
2
15
Analogies/differences with FISTA-type algorithms: update rule
tk+1 = 1 −
qk+1 1−qk+1 τ′
k
τ′
k+1 t2
k +
- qk+1
1−qk+1 τ′
k
τ′
k+1 t2
k − 1
2 + 4
τ′
k
τ′
k+1
t2
k
1−qk+1
2 No backtracking, convex case If µ = qk = 0 and τk = τk+1 for any k ≥ 0, this is the FISTA update rule.
15
Analogies/differences with FISTA-type algorithms: update rule
tk+1 = 1 −
qk+1 1−qk+1 τ′
k
τ′
k+1 t2
k +
- qk+1
1−qk+1 τ′
k
τ′
k+1 t2
k − 1
2 + 4
τ′
k
τ′
k+1
t2
k
1−qk+1
2 No backtracking, convex case If µ = qk = 0 and τk = τk+1 for any k ≥ 0, this is the FISTA update rule. FISTA with adaptive backtracking If µ = qk = 0 for any k ≥ 0, the rule reduces to: tk+1 = 1 +
- 1 + 4 τk
τk+1 t2 k
2 , which is the same as the one proposed by Scheinberg et al. ’14 for fast adaptive backtracking.
15
Accelerated convergence rates
Convergence rates: worst-case analysis
Define: Lw := max Lf ρ , ρL0
- ,
qw := µ Lw + µg , with qw being the worst-case inverse condition number. Theorem Let x0 ∈ X, τ0 > 0 and let (xk) the sequence produced by the GFISTA algorithm with backtracking. If t0 ≥ 0 and √q0t0 ≤ 1, we have: F(xk) − F(x∗) ≤ rk (Lw − µf )
- τ0t2
1 − µf τ0
- F(x0) − F(x∗)
- + 1
2x0 − x∗2
- where the decay rate is defined as:
rk := min
- 4
(k + 1)2 , (1 − √qw)k−1, (1 − √qw)k t2
- .
16
Convergence rates: worst-case analysis
Define: Lw := max Lf ρ , ρL0
- ,
qw := µ Lw + µg , with qw being the worst-case inverse condition number. Theorem Let x0 ∈ X, τ0 > 0 and let (xk) the sequence produced by the GFISTA algorithm with backtracking. If t0 ≥ 0 and √q0t0 ≤ 1, we have: F(xk) − F(x∗) ≤ rk (Lw − µf )
- τ0t2
1 − µf τ0
- F(x0) − F(x∗)
- + 1
2x0 − x∗2
- where the decay rate is defined as:
rk := min
- 4
(k + 1)2 , (1 − √qw)k−1, (1 − √qw)k t2
- .
Disclaimer Compare recent work by Florea, Vorobyov (preprint, ’17) where the same result is obtained via a generalised estimate sequence argument.
16
Monotone variants (M-GFISTA)
In order to make the convergence non-increasing5, we can simply set: y k = xk + βk
- xk − xk−1
+ tk tk − 1
- Tτk (y k−1) − xk
. This suggests an easy rule to select xk+1 at any iteration: xk+1 = Tτk+1(y k) if F(Tτk+1(y k)) ≤ F(xk), xk
- therwise.
Same computations, same convergence rates.
5Beck, Teboulle ’09, Tseng ’08, Tao, Boley, Zhang ’16
17
Imaging applications
Huber-TV Gaussian denoising
Given noisy u0 ∈ Rm×n corrupted by noise N(0, σ2), use TV ROF6 model: min
u
λDu2,1 + 1 2u − u02
2,
Du2,1 =
m,n
- i,j=1
- (Du)2
i,j,1 + (Du)2 i,j,2,
where Du is the finite-difference-discretised gradient and λ > 0.
6Rudin, Osher, Fatemi, ’92
18
Huber-TV Gaussian denoising
Given noisy u0 ∈ Rm×n corrupted by noise N(0, σ2), use TV ROF6 model: min
u
λDu2,1 + 1 2u − u02
2,
Du2,1 =
m,n
- i,j=1
- (Du)2
i,j,1 + (Du)2 i,j,2,
where Du is the finite-difference-discretised gradient and λ > 0. Strongly convex variant: for ε ≪ 1, C 1-Huber regularisation hε(t) :=
t2 2ε
for |t| ≤ ε, |t| − ε
2
for |t| > ε.
6Rudin, Osher, Fatemi, ’92
18
Huber-TV Gaussian denoising
Given noisy u0 ∈ Rm×n corrupted by noise N(0, σ2), use TV ROF6 model: min
u
λHε(u) + 1 2u − u02
2
Hε(u) :=
m,n
- i,j=1
hε
- (Du)2
i,j,1 + (Du)2 i,j,2
- where Du is the finite-difference-discretised gradient and λ > 0.
Strongly convex variant: for ε ≪ 1, C 1-Huber regularisation hε(t) :=
t2 2ε
for |t| ≤ ε, |t| − ε
2
for |t| > ε.
6Rudin, Osher, Fatemi, ’92
18
Huber-TV Gaussian denoising: dual formulation
The Huber-TV dual problem reads: min
p
1 2D∗p − u02
2 + ε
2λp2
2 + δ{·2,∞≤λ}(p),
where D∗ is the discretised negative finite-difference divergence and: δ{·2,∞≤λ}(p) = if |pi,j|2 ≤ λ for any i, j, +∞
- therwise.
19
Huber-TV Gaussian denoising: dual formulation
The Huber-TV dual problem reads: min
p
1 2D∗p − u02
2
- “f ”
+ ε 2λp2
2 + δ{·2,∞≤λ}(p)
- “g”
, where D∗ is the discretised negative finite-difference divergence and: δ{·2,∞≤λ}(p) = if |pi,j|2 ≤ λ for any i, j, +∞
- therwise.
Note:
- ∇f (p) = D(D∗p − u0) =
⇒ Lf ≤ 8;
- proxτg is easy to compute and µg = µ = ε
λ. 19
Huber-TV Gaussian denoising: dual formulation
The Huber-TV dual problem reads: min
p
1 2D∗p − u02
2
- “f ”
+ ε 2λp2
2 + δ{·2,∞≤λ}(p)
- “g”
, where D∗ is the discretised negative finite-difference divergence and: δ{·2,∞≤λ}(p) = if |pi,j|2 ≤ λ for any i, j, +∞
- therwise.
Note:
- ∇f (p) = D(D∗p − u0) =
⇒ Lf ≤ 8;
- proxτg is easy to compute and µg = µ = ε
λ.
Use monotone GFISTA with backtracking. . .
19
Huber-TV Gaussian denoising: results
Parameters: u0 ∈ R256×256, σ2 = 0.005, ε = 0.01, λ = 0.1. Have: µ = 0.1
(a) Ground truth (b) u0 (c) Reference u∗
20
Huber-TV Gaussian denoising: results
Parameters: u0 ∈ R256×256, σ2 = 0.005, ε = 0.01, λ = 0.1. Have: µ = 0.1
(a) Convergence rates. (b) Lipschitz constant estimate. Figure 1: Underestimating L0 = 5. GFISTA parameters: ρ = 0.9, t0 = 1, p0 = Du0.
20
Huber-TV Gaussian denoising: results
Parameters: u0 ∈ R256×256, σ2 = 0.005, ε = 0.01, λ = 0.1. Have: µ = 0.1
(a) Convergence rates. (b) Lipschitz constant estimate. Figure 1: Overestimating L0 = 20. GFISTA parameters: ρ = 0.9, t0 = 1, p0 = Du0.
20
Huber-TV Gaussian denoising: results
Parameters: u0 ∈ R256×256, σ2 = 0.005, ε = 0.01, λ = 0.1. Have: µ = 0.1
(a) Convergence rates. (b) Lipschitz constant estimate. Figure 1: Overestimating L0 = 20. GFISTA parameters: ρ = 0.9, t0 = 1, p0 = Du0.
Remark O(1/k2) convergence of naive FISTA (µ = 0).
20
Strongly convex TV-Poisson denoising: primal formulation
Poisson noise is typical in astronomy and microscopy imaging. . . For ε ≪ 1, consider the ε-strongly convex TV Poisson denoising model: min
u
λDu2,1 + ε 2u2
2 + ˜
KL(u0, u), where ˜ KL(u0, u) is a differentiable version of the Kullback-Leibler function7: ˜ KL(u0, u) :=
m,n
- i,j=1
ui,j + bi,j − u0
i,j + u0 i,j log
- u0
i,j
ui,j +bi,j
- if ui,j ≥ 0,
u0
i,j
2b2
i,j u2
i,j +
- 1 −
u0
i,j
bi,j
- ui,j + bi,j − u0
i,j + u0 i,j log
- u0
i,j
bi,j
- else,
and b ∈ Rm×n is the background image. We can brutally estimate: Lf = max
i,j
u0
i,j
b2
i,j
, for u0, b > 0. Moreover, proxτg can be computed solving TV ROF model.
7Chambolle, Ehrhardt, Richtarik, Sch¨
- nlieb, ’17
21
Strongly convex TV-Poisson denoising: primal formulation
Poisson noise is typical in astronomy and microscopy imaging. . . For ε ≪ 1, consider the ε-strongly convex TV Poisson denoising model: min
u
λDu2,1 + ε 2u2
2
- “g”
+ ˜ KL(u0, u)
- “f ”
, where ˜ KL(u0, u) is a differentiable version of the Kullback-Leibler function7: ˜ KL(u0, u) :=
m,n
- i,j=1
ui,j + bi,j − u0
i,j + u0 i,j log
- u0
i,j
ui,j +bi,j
- if ui,j ≥ 0,
u0
i,j
2b2
i,j u2
i,j +
- 1 −
u0
i,j
bi,j
- ui,j + bi,j − u0
i,j + u0 i,j log
- u0
i,j
bi,j
- else,
and b ∈ Rm×n is the background image. We can brutally estimate: Lf = max
i,j
u0
i,j
b2
i,j
, for u0, b > 0. Moreover, proxτg can be computed solving TV ROF model.
7Chambolle, Ehrhardt, Richtarik, Sch¨
- nlieb, ’17
21
Strongly convex TV-Poisson denoising: results
Parameters: u0 ∈ R256×256, ε = µ = 0.15, λ = 0.2. b constant, Lf ≤ 45.
(a) Ground truth (b) u0 (c) Reference u∗
22
Strongly convex TV-Poisson denoising: results
Parameters: u0 ∈ R256×256, ε = µ = 0.15, λ = 0.2. b constant, Lf ≤ 45.
(a) Convergence rates. (b) Lipschitz constant estimate. Figure 2: Overestimating L0 = 60. GFISTA parameters: ρ = 0.8, t0 = 1, u0 = u0. Relative objective:
F(uk )−F(u∗) F(u0)−F(u∗) . 22
Strongly convex TV-Poisson denoising: results
Parameters: u0 ∈ R256×256, ε = µ = 0.15, λ = 0.2. b constant, Lf ≤ 45.
Figure 2: Monotone decay with/without backtracking.
22
Conclusions & outlook
Conclusions & outlook
Take-home messages:
- If µf , µg > 0, linear convergence can be shown for GFISTA;
- Adaptive backtracking to get a local estimate Lk along the iterations.
- GFISTA with backtracking can be easily implemented and used for
imaging applications!
23
Conclusions & outlook
Take-home messages:
- If µf , µg > 0, linear convergence can be shown for GFISTA;
- Adaptive backtracking to get a local estimate Lk along the iterations.
- GFISTA with backtracking can be easily implemented and used for
imaging applications! Outlook:
- Estimate of µf and µg?Restarting! (O’Donoghue, Cand´
es, 2012).
- Milder (non-Lipschitz) differentiability assumptions (Salzo ’17)?
23
Main references
- L. Calatroni, A. Chambolle, Backtracking strategies for accelerated descent
methods with smooth composite objectives, arXiv:1709.09004, 2017.
- K. Scheinberg, D. Goldfarb, and X. Bai, Fast first-order methods for
composite convex optimization with backtracking, Foundations of Computational Mathematics 14, 3, 2014.
- M. I. Florea, S. Vorobyov, A generalized accelerated composite gradient
method: uniting Nesterov’s fast gradient method and FISTA, arXiv:1705.10266, 2017.
- A. Chambolle, T. Pock, An introduction to continuous optimization for