[PPT] - An adaptive backtracking strategy for non-smooth composite PowerPoint Presentation

SLIDE 1

An adaptive backtracking strategy for non-smooth composite optimisation problems

Luca Calatroni

Centre de Mathematiqu´ es Appliqu´ ees (CMAP), ´ Ecole Polytechnique, Palaiseau

joint work with: A. Chambolle.

CMIPI 2018 Workshop

University of Insubria, DISAT July 16-18 2018 Como, IT

SLIDE 2

Introduction

SLIDE 4

Gradient based methods: a review

(X, · ), Hilbert space. Given f : X → R convex, l.s.c., with x∗ ∈ arg min f , we want to solve: min

x∈X f (x) 2

SLIDE 5

Gradient based methods: a review

(X, · ), Hilbert space. Given f : X → R convex, l.s.c., with x∗ ∈ arg min f , we want to solve: min

x∈X f (x)

If f is differentiable with Lf -Lipschitz gradient, explicit gradient descent reads:

Algorithm 1 Gradient descent with fixed step.

Input: 0 < τ ≤ 2/Lf , x0 ∈ X. for k ≥ 0 do xk+1 = xk − τ∇f (xk) end for

Quite restrictive smoothness assumption!

2

SLIDE 6

Gradient based methods: a review

(X, · ), Hilbert space. Given f : X → R convex, l.s.c., with x∗ ∈ arg min f , we want to solve: min

x∈X f (x)

No further assumptions on ∇f : use implicit gradient descent.

Algorithm 2 Implicit (proximal) gradient descent with fixed step.

Input: τ > 0, x0 ∈ X. for k ≥ 0 do xk+1 = proxτf (xk)(= xk − τ∇f (xk+1)) end for

Note: the iteration can be rewritten as: xk+1 = xk − τ∇fτ(xk), with fτ(xk) := min

x∈X f (x) + x − xk2

2τ , the Moreau-Yosida regularisation of f , which is 1/τ-Lipschitz ⇒ explicit gradient descent on fτ. Same theory applies!

References: Brezis-Lions (’73, ’78), G¨ uler (’91),. . . 2

SLIDE 7

Convergence rates

Theorem: O(1/k) rate Let x0 ∈ X and τ ≤ 2/Lf . Then, the sequence (xk) of iterates of gradient descent converges to x∗ and satisfies: f (xk) − f (x∗) ≤ 1 2τk x∗ − x02.

3

SLIDE 8

Convergence rates

Theorem: O(1/k) rate Let x0 ∈ X and τ ≤ 2/Lf . Then, the sequence (xk) of iterates of gradient descent converges to x∗ and satisfies: f (xk) − f (x∗) ≤ 1 2τk x∗ − x02. Assume: f is µf -strongly convex, µf > 0: f (y) ≥ f (x) + ∇f (x), y − x + µf 2 y − x2, for all x, y ∈ X. Theorem: Linear rate for strongly convex objectives Let f be µf -strongly convex. Let x0 ∈ X and τ ≤ 2/(Lf + µf ). Then, the sequence (xk) of iterates of gradient descent satisfies: f (xk) − f (x∗) + 1 2τ xk − x∗2 ≤ ωk 2τ x∗ − x02, with ω = (1 − µf /L)/(1 + µf L) < 1. References: Bertsekas, ’15, Nesterov ’04

3

SLIDE 9

Lower bounds1

Theorem (Lower bounds)

Let x0 ∈ Rn, Lf > 0 and k < n. Then, for any first-order method there exists a convex C 1 function f with Lf -Lipschitz gradient such that:

1. convex case:

f (xk) − f (x∗) ≥ Lf 8(k + 1)2 x∗ − x02.

2. strongly convex case:

f (xk) − f (x∗) ≥ µf 2 √q − 1 √q + 1 2k x∗ − x02, where q = Lf /µf ≥ 1. Remark: If k ≥ n we could use conjugate gradient! However, for imaging n ≫ 1!

Usually k < n: can we improve convergence speed?

1Nesterov, ’04

4

SLIDE 10

Nesterov acceleration for gradient descent2

To make it faster build extrapolated sequence (inertia). Algorithm 3 Nesterov accelerated gradient descent with fixed step.

Input: 0 < τ ≤ 1/Lf , x0 = x−1 = y 0 ∈ X, t0 = 0. for k ≥ 0 do tk+1 = 1 +

1 + 4t2

k

2 y k = xk + tk − 1 tk+1 (xk − xk−1) xk+1 = y k − τ∇f (y k) end for

2Nesterov, ’83, ’04, G¨

uler ’92

5

SLIDE 11

Nesterov acceleration for gradient descent2

Algorithm 4 Nesterov accelerated gradient descent with fixed step.

Input: 0 < τ ≤ 1/Lf , x0 = x−1 = y 0 ∈ X, t0 = 0. for k ≥ 0 do tk+1 = 1 +

1 + 4t2

k

2 y k = xk + tk − 1 tk+1 (xk − xk−1) xk+1 = y k − τ∇f (y k) end for

Theorem (Acceleration) Let τ ≤ 1/Lf and (xk) the sequence generated by the accelerated gradient descent algorithm. Then: f (xk) − f (x∗) ≤ 2 τ(k + 1)2 x0 − x∗2.

2Nesterov, ’83, ’04, G¨

uler ’92

5

SLIDE 12

Standard problem in imaging: composite structure

Variational regularisation of ill-posed inverse problems

Compute a reconstructed version of a given degraded image f by solving: min

u∈X

{F(x) := R(u) + λD(u, f )} , λ > 0 with non-smooth regularisation and smooth data fidelity.

6

SLIDE 13

Standard problem in imaging: composite structure

Variational regularisation of ill-posed inverse problems

Compute a reconstructed version of a given degraded image f by solving: min

u∈X

{F(x) := R(u) + λD(u, f )} , λ > 0 with non-smooth regularisation and smooth data fidelity. Examples in inverse problems/imaging:

R(u) = TV , ICTV , TGV , ℓ1 (Rudin, Osher, Fatemi, ’92, Chambolle-Lions’ 97,

Bredies, ’10)

D(u, f ) = u − f 2

2 (Gaussian Rudin, Osher, Fatemi, ’92 ), D(u, f ) = u − f 1,γ

(Laplace/impulse, Nikolova, ’04), D(u, f ) = KLγ(u, f ) (Poisson, Burger, Sawatzky, Brune, M¨ uller, ’09). . .

6

SLIDE 14

Composite optimisation

We want to solve: min

x∈X {F(x) := f (x) + g(x)}

f is smooth: differentiable, convex with Lipschitz gradient

∇f (y) − ∇f (x) ≤ Lf y − x, for any x, y ∈ X.

g is convex, l.s.c., non-smooth, easy proximal map.

3Combettes, Ways, ’05, Nesterov, ’13. . .

7

SLIDE 15

Composite optimisation

We want to solve: min

x∈X {F(x) := f (x) + g(x)}

f is smooth: differentiable, convex with Lipschitz gradient

∇f (y) − ∇f (x) ≤ Lf y − x, for any x, y ∈ X.

g is convex, l.s.c., non-smooth, easy proximal map.

Composite optimisation problem Forward-Backward splitting3.

forward gradient descent step in f ;
backward implicit gradient descent step in g.

Basic algorithm: take x0 ∈ X, fix τ > 0 and for k ≥ 0 do: xk+1 = proxτg (xk − τ∇f (xk)) =: Tτxk.

3Combettes, Ways, ’05, Nesterov, ’13. . .

7

SLIDE 16

Composite optimisation

We want to solve: min

x∈X {F(x) := f (x) + g(x)}

f is smooth: differentiable, convex with Lipschitz gradient

∇f (y) − ∇f (x) ≤ Lf y − x, for any x, y ∈ X.

g is convex, l.s.c., non-smooth, easy proximal map.

Composite optimisation problem Forward-Backward splitting3.

forward gradient descent step in f ;
backward implicit gradient descent step in g.

Basic algorithm: take x0 ∈ X, fix τ > 0 and for k ≥ 0 do: xk+1 = proxτg (xk − τ∇f (xk)) =: Tτxk.

Rate of convergence: O(1/k).

3Combettes, Ways, ’05, Nesterov, ’13. . .

7

SLIDE 17

Accelerated forward-backward, FISTA: previous work

In Nesterov ’04 and Beck, Teboulle ’09, accelerated O(1/k2) convergence of is achieved by extrapolation (as above). Further properties:

convergence of iterates (Chambolle, Dossal ’15);
monotone variants (Beck, Teboulle ’09, Tseng ’08, Tao, Boley, Zhang ’15)
acceleration for inexact evaluation of operators (Villa, Salzo, Baldassarre,

Verri ’13, Bonettini, Prato, Rebegoldi, ’18)

8

SLIDE 18

Accelerated forward-backward, FISTA: previous work

In Nesterov ’04 and Beck, Teboulle ’09, accelerated O(1/k2) convergence of is achieved by extrapolation (as above). Further properties:

convergence of iterates (Chambolle, Dossal ’15);
monotone variants (Beck, Teboulle ’09, Tseng ’08, Tao, Boley, Zhang ’15)
acceleration for inexact evaluation of operators (Villa, Salzo, Baldassarre,

Verri ’13, Bonettini, Prato, Rebegoldi, ’18) Questions

1. Can we say more when f and/or g are strongly convex? Linear

convergence?

2. Can we let the gradient step (proximal parameter) vary along the

iterations AND preserving acceleration?

8

SLIDE 19

A strongly convex variant of FISTA (GFISTA)

Let µf , µg ≥ 0. Then µ = µf + µg. For τ > 0 define: q := τµ 1 + τµg ∈ [0, 1). Algorithm 5 GFISTA4 (no backtracking)

Input: 0 < τ ≤ 1/Lf , x0 = x−1 ∈ X and let t0 ∈ R s.t. 0 ≤ t0 ≤ 1/√q. for k ≥ 0 do y k = xk + βk(xk − xk−1) xk+1 = Tτ y k = proxτg(y k − τ∇f (y k)) tk+1 = 1 − qt2

k +

(1 − qt2

k )2 + 4t2 k

2 βk = tk − 1 tk+1 1 + τµg − tk+1τµ 1 − τµf end for

Remark: µ = q = 0 = ⇒ standard FISTA.

4Chambolle, Pock ’16

9

SLIDE 20

GFISTA: acceleration results

Theorem [Chambolle, Pock ’16] Let τ ≤ 1/Lf and 0 ≤ t0√q ≤ 1. Then, the sequence (xk) of iterates of GFISTA satisfies F(xk) − F(x∗) ≤ rk(q)

t2

0(F(x0) − F(x∗)) + 1 + τµg

2 x − x∗2

,

where x∗ is a minimiser of F and: rk(q) = min

4

(k + 1)2 , (1 + √q)(1 − √q)k, (1 − √q)k t2

.

Note: for µ = q = 0, t0 = 0 this is the standard FISTA convergence result.

10

SLIDE 21

GFISTA: acceleration results

Theorem [Chambolle, Pock ’16] Let τ ≤ 1/Lf and 0 ≤ t0√q ≤ 1. Then, the sequence (xk) of iterates of GFISTA satisfies F(xk) − F(x∗) ≤ rk(q)

t2

0(F(x0) − F(x∗)) + 1 + τµg

2 x − x∗2

,

where x∗ is a minimiser of F and: rk(q) = min

4

(k + 1)2 , (1 + √q)(1 − √q)k, (1 − √q)k t2

.

Note: for µ = q = 0, t0 = 0 this is the standard FISTA convergence result.

Question: What if an estimate of Lf is not available? Backtracking!

10

SLIDE 22

Backtracking idea

For plain 2D gradient descent:

Too small VS. too big τ

11

SLIDE 23

Backtracking idea

For plain 2D gradient descent:

Armijo line-search

11

SLIDE 24

Backtracking idea

For plain 2D gradient descent:

Armijo line-search Armijo rule: Choose τk = 1/2i where i ∈ N is the minimum integer s.t. f (xk+1) − f (xk) ≤ βτk∇f (xk)⊤(xk+1 − xk), 0 < β < 1. FISTA + backtracking:

Armijo-type (Beck, Teboulle, ’09): τk+1 ≤ τk for every k.
Full backtracking (Scheinberg, Goldfarb, Bai, ’14): larger steps in “flat” areas!

11

SLIDE 25

GFISTA with backtracking

SLIDE 26

Backtracking strategy and Bregman distance

General idea: check if for every x ∈ X: F(xk+1) + (1 + τµg)x − xk+12 2τ + xk+1 − y k2 2τ − Df (xk+1, y k)

≤ F(x) + (1 − τµf )x − y k2

2τ , where Df (xk+1, y k) := f (xk+1) − f (y k) − ∇f (y k), y k − xk+1 is the Bregman distance of f between xk+1 = Tτy k and y k.

12

SLIDE 27

Backtracking strategy and Bregman distance

General idea: check if for every x ∈ X: F(xk+1) + (1 + τµg)x − xk+12 2τ + xk+1 − y k2 2τ − Df (xk+1, y k)

≤ F(x) + (1 − τµf )x − y k2

2τ , where Df (xk+1, y k) := f (xk+1) − f (y k) − ∇f (y k), y k − xk+1 is the Bregman distance of f between xk+1 = Tτy k and y k. Constant steps Such condition is verified as long as: τ ≤ xk+1 − y k2 2Df (xk+1, y k) ∼ 1 Lk , () which is always true if τ ≤ 1/Lf with known estimate Lf . However, one can alternatively check () along the iterations. This corresponds to compute a local Lipschitz Constant Estimate (LCE).

12

SLIDE 28

GFISTA with backtracking: Algorithm

For any k ≥ 0 we let τ = τk and define: τ ′

k =

τk 1 − τkµf , qk = µτk 1 + τkµg . Update rule for extrapolation: for any k ≥ 0 set tk+1 = 1 −

qk+1 1−qk+1 τ′

k

τ′

k+1 t2

k +

qk+1

1−qk+1 τ′

k

τ′

k+1 t2

k − 1

2 + 4

τ′

k

τ′

k+1

t2

k

1−qk+1

2

13

SLIDE 29

GFISTA with backtracking: Algorithm

Algorithm 6 GFISTA with backtracking

Input: µf , µg, τ0 > 0, q0, ρ ∈ (0, 1), x0 = x−1 ∈ X and t0 ∈ R s.t. 0 ≤ t0 ≤ 1/√q0. for k ≥ 0 do y k = xk + βk(xk − xk−1). Set ibt = 0; if too close to LCE then while Backtracking condition (*) is not verified & ibt ≤ imax do keep/reduce step-size: τk+1 = ρibt τk; Compute xk+1 = Tτk+1 y k = proxτk+1g(y k − τk+1∇f (y k)) (1) ibt = ibt + 1; end while else if far enough from LCE then increase step-size: τk+1 =

τk ρ ;

Compute xk+1 using (1); end if Update qk+1, τ ′

k+1, tk+1.

Set βk+1 = 1 − qk+1tk+1 1 − qk+1 tk − 1 tk+1 . end for

Too close/too far: how tight is (*)? Reduce costs due to (1).

14

SLIDE 30

Analogies/differences with FISTA-type algorithms: update rule

tk+1 = 1 −

qk+1 1−qk+1 τ′

k

τ′

k+1 t2

k +

qk+1

1−qk+1 τ′

k

τ′

k+1 t2

k − 1

2 + 4

τ′

k

τ′

k+1

t2

k

1−qk+1

2

15

SLIDE 31

Analogies/differences with FISTA-type algorithms: update rule

tk+1 = 1 −

qk+1 1−qk+1 τ′

k

τ′

k+1 t2

k +

qk+1

1−qk+1 τ′

k

τ′

k+1 t2

k − 1

2 + 4

τ′

k

τ′

k+1

t2

k

1−qk+1

2 No backtracking, convex case If µ = qk = 0 and τk = τk+1 for any k ≥ 0, this is the FISTA update rule.

15

SLIDE 32

Analogies/differences with FISTA-type algorithms: update rule

tk+1 = 1 −

qk+1 1−qk+1 τ′

k

τ′

k+1 t2

k +

qk+1

1−qk+1 τ′

k

τ′

k+1 t2

k − 1

2 + 4

τ′

k

τ′

k+1

t2

k

1−qk+1

2 No backtracking, convex case If µ = qk = 0 and τk = τk+1 for any k ≥ 0, this is the FISTA update rule. FISTA with adaptive backtracking If µ = qk = 0 for any k ≥ 0, the rule reduces to: tk+1 = 1 +

1 + 4 τk

τk+1 t2 k

2 , which is the same as the one proposed by Scheinberg et al. ’14 for fast adaptive backtracking.

15

SLIDE 33

Accelerated convergence rates

SLIDE 34

Convergence rates: worst-case analysis

Define: Lw := max Lf ρ , ρL0

,

qw := µ Lw + µg , with qw being the worst-case inverse condition number. Theorem Let x0 ∈ X, τ0 > 0 and let (xk) the sequence produced by the GFISTA algorithm with backtracking. If t0 ≥ 0 and √q0t0 ≤ 1, we have: F(xk) − F(x∗) ≤ rk (Lw − µf )

τ0t2

1 − µf τ0

F(x0) − F(x∗)
+ 1

2x0 − x∗2

where the decay rate is defined as:

rk := min

4

(k + 1)2 , (1 − √qw)k−1, (1 − √qw)k t2

.

16

SLIDE 35

Convergence rates: worst-case analysis

Define: Lw := max Lf ρ , ρL0

,

qw := µ Lw + µg , with qw being the worst-case inverse condition number. Theorem Let x0 ∈ X, τ0 > 0 and let (xk) the sequence produced by the GFISTA algorithm with backtracking. If t0 ≥ 0 and √q0t0 ≤ 1, we have: F(xk) − F(x∗) ≤ rk (Lw − µf )

τ0t2

1 − µf τ0

F(x0) − F(x∗)
+ 1

2x0 − x∗2

where the decay rate is defined as:

rk := min

4

(k + 1)2 , (1 − √qw)k−1, (1 − √qw)k t2

.

Disclaimer Compare recent work by Florea, Vorobyov (preprint, ’17) where the same result is obtained via a generalised estimate sequence argument.

16

SLIDE 36

Monotone variants (M-GFISTA)

In order to make the convergence non-increasing5, we can simply set: y k = xk + βk

xk − xk−1

+ tk tk − 1

Tτk (y k−1) − xk

. This suggests an easy rule to select xk+1 at any iteration: xk+1 =    Tτk+1(y k) if F(Tτk+1(y k)) ≤ F(xk), xk

therwise.

Same computations, same convergence rates.

5Beck, Teboulle ’09, Tseng ’08, Tao, Boley, Zhang ’16

17

SLIDE 37

Imaging applications

SLIDE 38

Huber-TV Gaussian denoising

Given noisy u0 ∈ Rm×n corrupted by noise N(0, σ2), use TV ROF6 model: min

u

λDu2,1 + 1 2u − u02

2,

Du2,1 =

m,n

i,j=1
(Du)2

i,j,1 + (Du)2 i,j,2,

where Du is the finite-difference-discretised gradient and λ > 0.

6Rudin, Osher, Fatemi, ’92

18

SLIDE 39

Huber-TV Gaussian denoising

Given noisy u0 ∈ Rm×n corrupted by noise N(0, σ2), use TV ROF6 model: min

u

λDu2,1 + 1 2u − u02

2,

Du2,1 =

m,n

i,j=1
(Du)2

i,j,1 + (Du)2 i,j,2,

where Du is the finite-difference-discretised gradient and λ > 0. Strongly convex variant: for ε ≪ 1, C 1-Huber regularisation hε(t) :=   

t2 2ε

for |t| ≤ ε, |t| − ε

2

for |t| > ε.

6Rudin, Osher, Fatemi, ’92

18

SLIDE 40

Huber-TV Gaussian denoising

Given noisy u0 ∈ Rm×n corrupted by noise N(0, σ2), use TV ROF6 model: min

u

λHε(u) + 1 2u − u02

2

Hε(u) :=

m,n

i,j=1

hε

(Du)2

i,j,1 + (Du)2 i,j,2

where Du is the finite-difference-discretised gradient and λ > 0.

Strongly convex variant: for ε ≪ 1, C 1-Huber regularisation hε(t) :=   

t2 2ε

for |t| ≤ ε, |t| − ε

2

for |t| > ε.

6Rudin, Osher, Fatemi, ’92

18

SLIDE 41

Huber-TV Gaussian denoising: dual formulation

The Huber-TV dual problem reads: min

p

1 2D∗p − u02

2 + ε

2λp2

2 + δ{·2,∞≤λ}(p),

where D∗ is the discretised negative finite-difference divergence and: δ{·2,∞≤λ}(p) =    if |pi,j|2 ≤ λ for any i, j, +∞

therwise.

19

SLIDE 42

Huber-TV Gaussian denoising: dual formulation

The Huber-TV dual problem reads: min

p

1 2D∗p − u02

2

“f ”

+ ε 2λp2

2 + δ{·2,∞≤λ}(p)

“g”

, where D∗ is the discretised negative finite-difference divergence and: δ{·2,∞≤λ}(p) =    if |pi,j|2 ≤ λ for any i, j, +∞

therwise.

Note:

∇f (p) = D(D∗p − u0) =

⇒ Lf ≤ 8;

proxτg is easy to compute and µg = µ = ε

λ. 19

SLIDE 43

Huber-TV Gaussian denoising: dual formulation

The Huber-TV dual problem reads: min

p

1 2D∗p − u02

2

“f ”

+ ε 2λp2

2 + δ{·2,∞≤λ}(p)

“g”

, where D∗ is the discretised negative finite-difference divergence and: δ{·2,∞≤λ}(p) =    if |pi,j|2 ≤ λ for any i, j, +∞

therwise.

Note:

∇f (p) = D(D∗p − u0) =

⇒ Lf ≤ 8;

proxτg is easy to compute and µg = µ = ε

λ.

Use monotone GFISTA with backtracking. . .

19

SLIDE 44

Huber-TV Gaussian denoising: results

Parameters: u0 ∈ R256×256, σ2 = 0.005, ε = 0.01, λ = 0.1. Have: µ = 0.1

(a) Ground truth (b) u0 (c) Reference u∗

20

SLIDE 45

Huber-TV Gaussian denoising: results

Parameters: u0 ∈ R256×256, σ2 = 0.005, ε = 0.01, λ = 0.1. Have: µ = 0.1

(a) Convergence rates. (b) Lipschitz constant estimate. Figure 1: Underestimating L0 = 5. GFISTA parameters: ρ = 0.9, t0 = 1, p0 = Du0.

20

SLIDE 46

Huber-TV Gaussian denoising: results

Parameters: u0 ∈ R256×256, σ2 = 0.005, ε = 0.01, λ = 0.1. Have: µ = 0.1

(a) Convergence rates. (b) Lipschitz constant estimate. Figure 1: Overestimating L0 = 20. GFISTA parameters: ρ = 0.9, t0 = 1, p0 = Du0.

20

SLIDE 47

Huber-TV Gaussian denoising: results

Parameters: u0 ∈ R256×256, σ2 = 0.005, ε = 0.01, λ = 0.1. Have: µ = 0.1

(a) Convergence rates. (b) Lipschitz constant estimate. Figure 1: Overestimating L0 = 20. GFISTA parameters: ρ = 0.9, t0 = 1, p0 = Du0.

Remark O(1/k2) convergence of naive FISTA (µ = 0).

20

SLIDE 48

Strongly convex TV-Poisson denoising: primal formulation

Poisson noise is typical in astronomy and microscopy imaging. . . For ε ≪ 1, consider the ε-strongly convex TV Poisson denoising model: min

u

λDu2,1 + ε 2u2

2 + ˜

KL(u0, u), where ˜ KL(u0, u) is a differentiable version of the Kullback-Leibler function7: ˜ KL(u0, u) :=

m,n

i,j=1

       ui,j + bi,j − u0

i,j + u0 i,j log

u0

i,j

ui,j +bi,j

if ui,j ≥ 0,

u0

i,j

2b2

i,j u2

i,j +

1 −

u0

i,j

bi,j

ui,j + bi,j − u0

i,j + u0 i,j log

u0

i,j

bi,j

else,

and b ∈ Rm×n is the background image. We can brutally estimate: Lf = max

i,j

u0

i,j

b2

i,j

, for u0, b > 0. Moreover, proxτg can be computed solving TV ROF model.

7Chambolle, Ehrhardt, Richtarik, Sch¨

nlieb, ’17

21

SLIDE 49

Strongly convex TV-Poisson denoising: primal formulation

Poisson noise is typical in astronomy and microscopy imaging. . . For ε ≪ 1, consider the ε-strongly convex TV Poisson denoising model: min

u

λDu2,1 + ε 2u2

2

“g”

+ ˜ KL(u0, u)

“f ”

, where ˜ KL(u0, u) is a differentiable version of the Kullback-Leibler function7: ˜ KL(u0, u) :=

m,n

i,j=1

       ui,j + bi,j − u0

i,j + u0 i,j log

u0

i,j

ui,j +bi,j

if ui,j ≥ 0,

u0

i,j

2b2

i,j u2

i,j +

1 −

u0

i,j

bi,j

ui,j + bi,j − u0

i,j + u0 i,j log

u0

i,j

bi,j

else,

and b ∈ Rm×n is the background image. We can brutally estimate: Lf = max

i,j

u0

i,j

b2

i,j

, for u0, b > 0. Moreover, proxτg can be computed solving TV ROF model.

7Chambolle, Ehrhardt, Richtarik, Sch¨

nlieb, ’17

21

SLIDE 50

Strongly convex TV-Poisson denoising: results

Parameters: u0 ∈ R256×256, ε = µ = 0.15, λ = 0.2. b constant, Lf ≤ 45.

(a) Ground truth (b) u0 (c) Reference u∗

22

SLIDE 51

Strongly convex TV-Poisson denoising: results

Parameters: u0 ∈ R256×256, ε = µ = 0.15, λ = 0.2. b constant, Lf ≤ 45.

(a) Convergence rates. (b) Lipschitz constant estimate. Figure 2: Overestimating L0 = 60. GFISTA parameters: ρ = 0.8, t0 = 1, u0 = u0. Relative objective:

F(uk )−F(u∗) F(u0)−F(u∗) . 22

SLIDE 52

Strongly convex TV-Poisson denoising: results

Parameters: u0 ∈ R256×256, ε = µ = 0.15, λ = 0.2. b constant, Lf ≤ 45.

Figure 2: Monotone decay with/without backtracking.

22

SLIDE 53

Conclusions & outlook

SLIDE 54

Conclusions & outlook

Take-home messages:

If µf , µg > 0, linear convergence can be shown for GFISTA;
Adaptive backtracking to get a local estimate Lk along the iterations.
GFISTA with backtracking can be easily implemented and used for

imaging applications!

23

SLIDE 55

Conclusions & outlook

Take-home messages:

If µf , µg > 0, linear convergence can be shown for GFISTA;
Adaptive backtracking to get a local estimate Lk along the iterations.
GFISTA with backtracking can be easily implemented and used for

imaging applications! Outlook:

Estimate of µf and µg?Restarting! (O’Donoghue, Cand´

es, 2012).

Milder (non-Lipschitz) differentiability assumptions (Salzo ’17)?

23

SLIDE 56

Main references

L. Calatroni, A. Chambolle, Backtracking strategies for accelerated descent

methods with smooth composite objectives, arXiv:1709.09004, 2017.

K. Scheinberg, D. Goldfarb, and X. Bai, Fast first-order methods for

composite convex optimization with backtracking, Foundations of Computational Mathematics 14, 3, 2014.

M. I. Florea, S. Vorobyov, A generalized accelerated composite gradient

method: uniting Nesterov’s fast gradient method and FISTA, arXiv:1705.10266, 2017.

A. Chambolle, T. Pock, An introduction to continuous optimization for

imaging, Acta Numerica 25, 2016.

24

SLIDE 57

An adaptive backtracking strategy for non-smooth composite optimisation problems

Luca Calatroni

Centre de Mathematiqu´ es Appliqu´ ees (CMAP), ´ Ecole Polytechnique, Palaiseau

joint work with: A. Chambolle.

CMIPI 2018 Workshop

University of Insubria, DISAT July 16-18 2018 Como, IT

Table of contents

1

Introduction

Gradient based methods: a review

(X, · ), Hilbert space. Given f : X → R convex, l.s.c., with x∗ ∈ arg min f , we want to solve: min

x∈X f (x) 2

Gradient based methods: a review

(X, · ), Hilbert space. Given f : X → R convex, l.s.c., with x∗ ∈ arg min f , we want to solve: min

x∈X f (x)

If f is differentiable with Lf -Lipschitz gradient, explicit gradient descent reads:

Algorithm 1 Gradient descent with fixed step.

Input: 0 < τ ≤ 2/Lf , x0 ∈ X. for k ≥ 0 do xk+1 = xk − τ∇f (xk) end for

Quite restrictive smoothness assumption!

2

Gradient based methods: a review

(X, · ), Hilbert space. Given f : X → R convex, l.s.c., with x∗ ∈ arg min f , we want to solve: min

x∈X f (x)

No further assumptions on ∇f : use implicit gradient descent.

Algorithm 2 Implicit (proximal) gradient descent with fixed step.

Input: τ > 0, x0 ∈ X. for k ≥ 0 do xk+1 = proxτf (xk)(= xk − τ∇f (xk+1)) end for

Note: the iteration can be rewritten as: xk+1 = xk − τ∇fτ(xk), with fτ(xk) := min

x∈X f (x) + x − xk2

2τ , the Moreau-Yosida regularisation of f , which is 1/τ-Lipschitz ⇒ explicit gradient descent on fτ. Same theory applies!

References: Brezis-Lions (’73, ’78), G¨ uler (’91),. . . 2

Convergence rates

Theorem: O(1/k) rate Let x0 ∈ X and τ ≤ 2/Lf . Then, the sequence (xk) of iterates of gradient descent converges to x∗ and satisfies: f (xk) − f (x∗) ≤ 1 2τk x∗ − x02.

3

Convergence rates

3

Lower bounds1

Theorem (Lower bounds)

Let x0 ∈ Rn, Lf > 0 and k < n. Then, for any first-order method there exists a convex C 1 function f with Lf -Lipschitz gradient such that:

f (xk) − f (x∗) ≥ Lf 8(k + 1)2 x∗ − x02.

f (xk) − f (x∗) ≥ µf 2 √q − 1 √q + 1 2k x∗ − x02, where q = Lf /µf ≥ 1. Remark: If k ≥ n we could use conjugate gradient! However, for imaging n ≫ 1!

Usually k < n: can we improve convergence speed?

4

Nesterov acceleration for gradient descent2

To make it faster build extrapolated sequence (inertia). Algorithm 3 Nesterov accelerated gradient descent with fixed step.

Input: 0 < τ ≤ 1/Lf , x0 = x−1 = y 0 ∈ X, t0 = 0. for k ≥ 0 do tk+1 = 1 +

2 y k = xk + tk − 1 tk+1 (xk − xk−1) xk+1 = y k − τ∇f (y k) end for

uler ’92

5

Nesterov acceleration for gradient descent2

Algorithm 4 Nesterov accelerated gradient descent with fixed step.

Input: 0 < τ ≤ 1/Lf , x0 = x−1 = y 0 ∈ X, t0 = 0. for k ≥ 0 do tk+1 = 1 +

2 y k = xk + tk − 1 tk+1 (xk − xk−1) xk+1 = y k − τ∇f (y k) end for

Theorem (Acceleration) Let τ ≤ 1/Lf and (xk) the sequence generated by the accelerated gradient descent algorithm. Then: f (xk) − f (x∗) ≤ 2 τ(k + 1)2 x0 − x∗2.

uler ’92

5

Standard problem in imaging: composite structure

Variational regularisation of ill-posed inverse problems

Compute a reconstructed version of a given degraded image f by solving: min

u∈X

{F(x) := R(u) + λD(u, f )} , λ > 0 with non-smooth regularisation and smooth data fidelity.

6

Standard problem in imaging: composite structure

Variational regularisation of ill-posed inverse problems

Compute a reconstructed version of a given degraded image f by solving: min

u∈X

{F(x) := R(u) + λD(u, f )} , λ > 0 with non-smooth regularisation and smooth data fidelity. Examples in inverse problems/imaging:

Bredies, ’10)

2 (Gaussian Rudin, Osher, Fatemi, ’92 ), D(u, f ) = u − f 1,γ

(Laplace/impulse, Nikolova, ’04), D(u, f ) = KLγ(u, f ) (Poisson, Burger, Sawatzky, Brune, M¨ uller, ’09). . .

6

Composite optimisation

We want to solve: min

x∈X {F(x) := f (x) + g(x)}

∇f (y) − ∇f (x) ≤ Lf y − x, for any x, y ∈ X.

7

Composite optimisation

We want to solve: min

x∈X {F(x) := f (x) + g(x)}

∇f (y) − ∇f (x) ≤ Lf y − x, for any x, y ∈ X.

Composite optimisation problem Forward-Backward splitting3.