[PPT] - Spectral properties of steplength selections in gradient methods: PowerPoint Presentation

SLIDE 1

Spectral properties of steplength selections in gradient methods: from unconstrained to constrained optimization

L. Zanni

Department of Physics, Informatics and Mathematics, University of Modena and Reggio Emilia, Italy

Variational Methods and Optimization in Imaging

IHP - Paris, 4 - 8 February 2019 Joint work with:

S. Crisci, V. Ruggiero, University of Ferrara, Italy
F. Porta, University of Modena and Reggio Emilia, Italy
L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 2

Outline

1

Gradient methods for unconstrained problems Spectral properties of steplength selections Design selection rules by exploiting spectral properties From the quadratic case to general unconstrained problems

2

Gradient projection methods for box-constrained problems Spectral properties of steplengths in the quadratic case New steplength rules taking into account the constraints

3

Scaled gradient projection methods Define the diagonal scaling The steplengths in variable metric approaches Practical behaviour in imaging

4

Conclusions

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 3

Motivation for the steplength analysis

Constrained optimization problems

min

x∈Ω f(x)

(1) f : RN − → R continuously differentiable function Ω ⊂ RN, nonempty closed convex set defined by simple constraints

Gradient Projection (GP) methods for minx∈Ω f(x)

x(k+1) = x(k) + ϑkd(k) d(k) = PΩ

x(k) − αk∇f(x(k))
− x(k)

αk > 0, ϑk ∈ (0, 1], PΩ(x) = argminz∈Ω z − x Usually the updating rules for the steplength αk are those exploited in the unconstrained case: is this a suitable choice?

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 4

Spectral analysis of steplength selections

➤ The unconstrained case ➤ The box-constrained case ➤ The Scaled Gradient Projection methods

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 5

Steplength selection: the unconstrained case

The recipe exploited by state-of-the-art selection rules: define steplengths by trying to capture, in an inexpensive way, some second order information design selection rules in the strictly convex quadratic case: f(x) = 1 2xT Ax − bT x, A symmetric positive definite

second order information ↔ spectral properties of A

design selection rules that generalize, in an inexpensive way, to non-quadratic cases

∇2f(x(k)) depends on the iterations but ∇2f(x(k)) → ∇2f(x∗)

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 6

A popular example: the Barzilai-Borwein (BB) selection rules

Consider the gradient method for the problem min f(x): x(k+1) = x(k) − αk∇f(x(k)) k = 0, 1, . . . ,

Suggestion [Barzilai-Borwein, IMA J. Num. Anal. 1988]:

Force the matrix (αkI)−1 to approximate the Hessian ∇2f(x(k)) by imposing quasi-Newton properties αBB1

k

= argmin

α∈R

(αI)−1s(k−1) − z(k−1) = s(k−1)T s(k−1) s(k−1)T z(k−1)

r

αBB2

k

= argmin

α∈R

s(k−1) − (αI)z(k−1) = s(k−1)T z(k−1) z(k−1)T z(k−1) where s(k−1) =

x(k) − x(k−1)

, z(k−1) = (∇f(x(k)) − ∇f(x(k−1))).

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 7

Spectral properties of the BB steplength rules

Consider a gradient method for the quadratic unconstrained case: min f(x) ≡ 1 2xT Ax − bT x, A = diag(λ1, . . . , λN), 0 < λ1 < · · · < λN x(k+1) = x(k) − αkg(k), g(k) = ∇f(x(k)), k = 0, 1, . . .

➩

g(k+1)

i

= (1 − αkλi)g(k)

i

i = 1, . . . , N

αk = 1

λi

⇒ g(k+1)

i

= 0 ⇒ g(k+j)

i

= 0, j = 2, 3 . . .

αk+i−1 = 1

λi ,

i = 1, . . . , N ⇒ g(k+N) = 0 (Finite Termination) αk must aim at approximating the inverse of the eigenvalues of A

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 8

BB rules in the quadratic case

1 λN ≤ αBB2

k

= g(k−1)T Ag(k−1) g(k−1)T A2g(k−1) ≤ αBB1

k

= g(k−1)T g(k−1) g(k−1)T Ag(k−1) ≤ 1 λ1

Example

f(x) = 1 2xT Ax − bT x A = diag(λ1, . . . , λ10), λi = 111i − 110 b random vector; bi ∈ [−10, 10] stopping rule: g(k) ≤ 10−8g(0)

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 9

Quadratic case: exploiting spectral properties

In the quadratic case (A = diag(λ1, . . . , λN),

0 < λ1 < · · · < λN), we have

g(k+1)

j

= (1 − αkλj)g(k)

j

j = 1, . . . , N

αk ≈ 1

λi ⇒             

g(k+1)

i

≪
g(k)

i

very useful
g(k+1)

j

<
g(k)

j

if

j < i

useful

g(k+1)

j

>
g(k)

j

if

j > i, λj > 2λi

dangerous

αBB2

k

/αBB1

k

= cos2(g(k−1), Ag(k−1)) Idea for improving the BB rules:

force a sequence of small αBB2

k

to reduce |gi| for large i, leading to gradients in which these components are not dominant after a sequence of small αk, if αBB2

k

/αBB1

k

≈ 1, exploit αBB1 =

gT g gT Ag

aiming at obtaining αBB1 ≈ 1/λi for small i

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 10

Practical implementations of this idea: ABB and ABBmin rules Alternate Barzilai-Borwein selection rule [Zhou-Gao-Dai, COAP (2006)]

αABB

k

=      αBB2

k

if

αBB2

k

αBB1

k

< τ, τ ∈ (0, 1) αBB1

k

therwise

ABBmin rule [Frassoldati-Zanghirati-Zanni, JIMO (2008)]

α

ABBmin k

=   

min

αBB2

j

| j = max{1, k − Mα}, ..., k

if αBB2

k

/ αBB1

k

< τ αBB1

k

therwise

where Mα > 0 is a parameter.

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 11

ABB and ABBmin rules on the previous toy problem

20 40 60 80 100 120 140 10

−3

10

−2

10

−1

10

ABB Iterations αk

5 10 15 20 25 30 35 40 45 10

−3

10

−2

10

−1

10

ABBmin Iterations αk

Cauchy Steepest Descent (CSD) αk = argminα>0 f(x(k) − αkg(k)) BB1 → αk = αBB1

k

BB2 → αk = αBB2

k

ABB → alternation ABBmin → modified alternation

50 100 150 200 250 300 350 400 450 500 10

−6

10

−4

10

−2

10

Error Iterations ||xk−x*||/||x*||

CSD BB1 BB2 ABB ABBmin

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 12

Quadratic test problems: N = 1000

λ1 = 1, λN = 104, λi, i = 2, . . . , N − 1, log-spaced λ = 1, λ = 103, λi = λ + (λ − λ) ∗ si, i = 1, . . . , N, si ∈ (0, 0.2), i = 1, . . . , N/2, si ∈ (0.8, 1), i = N/2 + 1, . . . , N.

[Di Serafino-Ruggiero-Toraldo-Z., AMC 2018]

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 13

Other efficient steplength rules based on spectral properties

[Pronzato-Zhigljavsky, Comput. Optim. Appl. 50 (2011)] [Fletcher, Math. Program. Ser. A 135 (2012)] [Pronzato-Zhigljavsky-Bukina, Acta Appl. Math. 127 (2013)] [De Asmundis-Di Serafino-Riccio-Toraldo, IMA J. Numer. Anal. 33 (2013)]] [De Asmundis-Di Serafino-Hager-Toraldo-Zhan, Comput. Optim. Appl. 59 (2014)] [Gonzaga-Schneider, Comput. Optim. Appl. 63 (2016)] [Gonzaga, Math. Program. Ser. A 160 (2016)]

Aimed at breaking the well-known cycling behaviour of the Steepest Descent method they share R-linear convergence rate in the quadratic case not all these rules easily generalize to general non-quadratic problems (BB-based rules have this crucial property)

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 14

General unconstrained problems: minx∈RN f(x)

Gradient methods with nonmonotone linesearch:

Init.: x(0) ∈ RN, 0 < αmin ≤ αmax, α0 ∈ [αmin, αmax], δ, σ ∈ (0, 1), M ∈ N; for k = 0, 1, . . . νk = αk; fref = max{f(x(k−j)), 0 ≤ j ≤ min(k, M)}; while f(x(k) − νkg(k)) > fref − σνkg(k)T g(k) (line search) νk = δνk; end x(k+1) = x(k) − νkg(k); define a tentative steplength αk+1 ∈ [αmin, αmax] end

➤ tentative steplength: exploit effective steplength selections designed for the quadratic case and generalizable in an inexpensive way. ➤ R-linear convergence of {f(x(k))} when f is strongly convex with Lipschitz-cont. gradient ([Dai, JOTA 2002], [Dai-Liao, IMA J.Num.Anal. 2002])

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 15

The standard BB rules can be improved

Trigonometric test problems: n = 50 f(x) = b − (Av(x) + Bu(x))2, v(x) = (sin(x1), ..., sin(xn))T , u(x) = (cos(x1), ..., cos(xn))T , A, B n × n random matrices integer entries in (−100, 100)

200 400 600 800 1000 1200 10-5 100 105

Convex2 test problems: N = 100 f(x) =

n

i=1

i 10(exi − xi);

20 40 60 80 100 120 10-10 10-5 100

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 16

The steplengths mimic the behaviour in the quadratic case

Convex2 test problems: N = 100

Green squares: 20 eigenvalues of ∇2f(x∗) with linearly spaced indices Blue dots: 20 eigenvalues of ∇2f(x(k)) with linearly spaced indices Red cross:

1 νk (black circles mean linesearch reductions)

20 40 60 80 100 120 10-1 100 101 20 40 60 80 iterations 10-1 100 101 1/νk ABBmin

When the Hessian eigenvalues stabilize, the steplengths ex- hibit the spectral properties observed in the quadratic case, but with respect to the spectrum of the current Hessian.

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 17

Spectral analysis of steplength selections

➤ The unconstrained case ➤ The box-constrained case ➤ The Scaled Gradient Projection methods

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 18

What about the constrained case?

Constrained minimization problems: minx∈Ω f(x)

x(k+1) = x(k) + ϑkd(k), d(k) = PΩ

x(k) − αk∇f(x(k))
− x(k)

PΩ(x) = argminz∈Ω z − x, Ω ⊂ RN More difficult analysis ➤ The goal is no more the gradient annihilation ➤ The gradient projection step makes the relation between successive gradients more complicated Motivation for generalizing the analysis ➤ BB rules are considered very effective also in the constrained case and were successfully exploited in many interesting applications ➤ ABB strategies seem still to outperform standard BB rules

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 19

The simplest case: box-constrained quadratic problems

min

ℓ≤x≤u f(x) ≡ 1

2xT Ax − bT x, A sym. pos. def., l, u ∈ Rn Gradient Projection (GP) method

Init.: ℓ ≤ x(0) ≤ u, 0 < αmin ≤ αmax, α0 ∈[αmin, αmax], δ, σ ∈ (0, 1), M ∈N; for k = 0, 1, . . . d(k) = Pℓ≤x≤u

x(k) − αkg(x(k))
− x(k);

(gradient projection step) ϑk = 1; fref = max{f(x(k−j)), 0 ≤ j ≤ min(k, M)}; while f(x(k) + ϑkd(k)) > fref + σϑkg(k)T d(k) (line search) ϑk = δϑk; end x(k+1) = x(k) + ϑkd(k); define the steplength αk+1 ∈ [αmin, αmax] (steplength updating rule) end

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 20

Box-constrained QP: minℓ≤x≤u f(x) ≡ 1

2xTAx − bTx The solution x∗ satisfies

       g(x∗)i = 0 for ℓi < x∗

i < ui

(i ∈ I∗) g(x∗)i ≤ 0 for x∗

i = ui

(i ∈ J ∗) g(x∗)i ≥ 0 for x∗

i = ℓi

(i ∈ J ∗) Define the set of indices

Jk−1 = {i : (x(k−1)

i

= ℓi ∧ g(k−1)

i

≥ 0) ∨ (x(k−1)

i

= ui ∧ g(k−1)

i

≤ 0)} Ik−1 = {1, ..., n} \ Jk−1

Possible idea

Since g(x∗)i = 0, i ∈ I∗, exploit the steplength rules to accelerate the reduction of |g(k−1)

i

|, i ∈ Ik−1, as done in the unconstrained case

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 21

Are the BB steplength rules useful to this purpose?

αBB1

k

= s(k−1)T s(k−1) s(k−1)T z(k−1) , αBB2

k

= s(k−1)T z(k−1) z(k−1)T z(k−1) ,

s(k−1) = (x(k) − x(k−1)) z(k−1) = (g(k)) − g(k−1))

What about s(k−1)?

(observe that x(k)

j

= x(k−1)

j

, for j ∈ Jk−1) s(k−1)

Jk−1 = 0 ⇒

                   αBB1

k

= s(k−1)

Ik−1 T s(k−1) Ik−1

s(k−1)

Ik−1 T z(k−1) Ik−1

= argmin

α∈R

(αkI)−1s(k−1)

Ik−1 − z(k−1) Ik−1

αBB2

k

= s(k−1)

Ik−1 T z(k−1) Ik−1

z(k−1)

Ik−1 T z(k−1) Ik−1 + z(k−1) Jk−1 T z(k−1) Jk−1

Only the αBB1

k

rule is able to capture the spectral properties of the Reduced Hessian AIk−1,Ik−1 at the current iteration: λmin(AIk−1,Ik−1) ≤ 1/αBB1

k

≤ λmax(AIk−1,Ik−1).

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 22

Box-constrained QP: different behaviour of αBB1

k

and αBB2

k TP1: n = 1000, 500 active const., λi(AIk−1,Ik−1) ∈ [10, 103] log-spaced

50 100 150 100 102 104

TP2: λi = M+m

2

+ M−m

2

cos( π(i−1)

n−1 ),

m = 1, M = 103

20 40 60 200 400 600 800 1000 50 100 200 400 600 800 1000

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 23

New proposals [Crisci-Ruggiero-Zanni, AMC 2019]

αBB2

k

= s(k−1)

Ik−1 T z(k−1) Ik−1

z(k−1)

Ik−1 T z(k−1) Ik−1 + z(k−1) Jk−1 T z(k−1) Jk−1

→ αMBB2

k

= s(k−1)

Ik−1 T z(k−1) Ik−1

z(k−1)

Ik−1 T z(k−1) Ik−1

Modified BB2 steplength rule

λmin(AIk−1,Ik−1) ≤ 1 αBB1

k

≤ 1 αMBB2

k

≤ λmax(AIk−1,Ik−1).

TP2: λi = M+m

2

+ M−m

2

cos( π(i−1)

n−1 ),

m = 1, M = 103

50 100 200 400 600 800 1000 10 20 30 40 50 60 70 200 400 600 800 1000

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 24

Modified BB2 can be exploited within ABB strategies

Modified ABBmin rule

α

MABBmin k

= min

αMBB2

j

| j = max{1, k − Mα}, ..., k

if αMBB2

k

/ αBB1

k

< τ αBB1

k

therwise

where Mα > 0 is a parameter.

TP3: n = 1000, 500 active const., λi(A) ∈ [1, 103] log-spaced

100 200 300 100 101 102 103 50 100 150 200 250 100 101 102 103

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 25

Performance profile: box-constrained QP test problems

Test Problems [Mor´

e –Toraldo, Num. Math. 1989]

108 box-const. QP, 15000 ≤ n ≤ 25000, K(A) = 104, 105, 106, nact = 0.1n, 0.5n, 0.9n

Methods GP method with nonmonotone linesearch and different steplength rules (BB2, MBB2, ABBmin, MABBmin ...) stopping rules:

ϕ(x(k))2 ≤ 10−5∇f(x(0))2, (ϕ(x))i =      (∇f(x))i , for xi = ℓi and xi = ui min

0, (∇f(x))i
,

for xi = ℓi max

0, (∇f(x))i
,

for xi = ui.

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 26

Performance profile (x ← Tsolver

Tmin , y ← %prob. solved within xTmin)

➤ The new MBB2 selection outperforms the standard BB2 rule

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 1.2 1.4 1.6 1.8 2 2.2 2.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

➤ Alternated strategies are preferable also in the constrained case

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 27

General box-constrained problems: minℓ≤x≤u f(x)

Test Problems [Facchinei-Judice-Soares, ACM TOMS 1997] min

ℓ≤x≤u f(x) ≡ g(x) +

i∈L

hi(xi) −

i∈U

hi(xi) L = {i | x∗

i = ℓi}

U = {i | x∗

i = ui},

g(x) =    Trigonometric Chained Rosenbrock Laplace2 hi(xi) =    βi(xi − x∗

i )

αi (xi − x∗

i )3 + βi (xi − x∗ i )

αi (xi − x∗

i )7/3 + βi (xi − x∗ i )

Problem size: n = 500; total number of problems: 108

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 1.5 2 2.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 28

Alternated BB are preferable

1 1.5 2 2.5 3 3.5 4 4.5 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Alternated BB in practical applications

Machine learning: decomp. techniques for training of Support Vector Machines [Serafini-Zanghirati-Z., Par. Comput. 2003, OMS 2005, JMLR 2006] Imaging problems in Astronomy, Microscopy, Computed Tomography

[Bonettini-Zanella-Z., Inv. Prob. 2009], [Loris et al. ACHA 2009], [Ruggiero et al. JGO 2010], [Prato et al., A&A 2012], [Zanella et al., Sci. Rep. 2013], [Piccolomini et al. COAP 2018]

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 29

Spectral analysis of steplength selections

➤ The unconstrained case ➤ The box-constrained case ➤ The Scaled Gradient Projection methods

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 30

Basic variable metric approaches

In many imaging applications the behaviour of gradient projection schemes is largely improved by exploiting variable metric approaches:

Scaled Gradient Projection (SGP) methods for minx∈Ω f(x)

x(k+1) = x(k) + ϑkd(k), ϑk ∈ (0, 1], d(k) = PΩ,D−1

k

x(k) − αkDk∇f(x(k))
− x(k),

αk > 0 PΩ,D−1

k (x)

=

argminz∈Ω z − xD−1

k

Dk sym. pos. def. matrix z − xD−1

k

≡

(z − x)T D−1

k (z − x)

How can the matrix Dk be chosen? How can the steplength rules for αk be modified for taking into account the scaling matrix?

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 31

SGP methods: convergence analysis

x(k+1) = x(k) + ϑkd(k), ϑk ∈ (0, 1], d(k) = PΩ,D−1

k

x(k) − αkDk∇f(x(k))
− x(k),

αk > 0

Analysis of {x(k)} and {f(x(k))} [Bonettini-Prato, Inv. Prob. 2015]

Dk with eigenvalues in

1

µ, µ

, µ ≥ 1
αk ∈ [αmin, αmax],

0 < αmin ≤ αmax

ϑk | f(x(k+1)) ≤ f(x(k)) + σϑk∇f(x(k))T d(k)

⇒ If x(kl) l→∞ − − − → x∗ then ∇f(x∗)T (x − x∗) ≥ 0 ∀ x ∈ Ω ————————

f(x) convex, the solution set X∗ not empty
µ2

k = 1 + γk,

γk ≥ 0, ∞

k=0 γk < ∞

Dk s.p.d. with eigenvalues in

1

µk , µk

⇒

x(k) k→∞ − − − → x∗ x∗ ∈ X∗ ————————

∇f is Lipschitz on Ω

⇒ f(k) − f∗ = O 1

k

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 32

Variable metric updating: the choice of the matrix Dk

A standard choice:

Dk = diag

D(k)

1 , D(k) 2 , . . . , D(k) N

D(k)

i

= min

ρ, max
1

ρ, ∂2f(x(k)) (∂xi)2 −1 , i = 1, . . . , N,

Define Dk by exploiting only first-order information Consider the special non-negatively constrained case: minx≥0 f(x) and the corresponding KKT conditions ∇f(x) − ξ = 0, x ≥ 0, ξ ≥ 0, xiξi = 0, i = 1, . . . , N

➩

x · ∇f(x) = 0, x ≥ 0, ∇f(x) ≥ 0 “ · ” denotes the component-wise product

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 33

Variable metric updating: the choice of the matrix Dk

Split the gradient [Lant´

eri-Roche-Aime, Inv. Prob. (2002)]:

∇f(x) = V (x) − U(x), V (x) > 0, U(x) ≥ 0 and use the splitting in the nonlinear equation x · ∇f(x) = 0: x · V (x) = x · U(x) = x · (−∇f(x) + V (x)),

➩

x = x− x V (x)·∇f(x) = x−Dx∇f(x), Dx = diag

x1

V1(x), . . . , xN VN(x)

Iterative methods for x · ∇f(x) = 0 based on scaled gradient direction:

x(k+1) = x(k) − Dk∇f(x(k)), Dk = diag

x(k)

1

V1(x(k)), . . . , x(k)

N

VN(x(k))

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 34

Variable metric updating: the choice of the matrix Dk

The same suggestion arises from a Majorization-Minimiz. (MM) framework

[Yang-Oja, IEEE Trans. Neural Net.(2011)]

➤ Consider discrepancy funct. D(Hx, g), Hi,j ≥ 0, xi > 0 written as D(Hx, g) =

p

d=1

n

i=1

αd,ih((Hx)i, ζd), h(σ, t) = σt−1

t

if t = 0 log(σ) if t = 0 ➤ A surrogate function G(x, ¯ x) of D(Hx, g) at ¯ x up to an additive constant can be defined in terms of the splitting

∇D(H ¯ x, g) = V (¯ x) − U(¯ x), V (¯ x) > 0, U(¯ x) ≥ 0

G(x, ¯ x) =

n

j=1

¯ xj(V (¯ x))jh xj ¯ xj , ζmax

− ¯

xj(U(¯ x))jh xj ¯ xj , ζmin

where

ζmax = max

d∈{1,...,p} ζd,

ζmin = min

d∈{1,...,p} ζd

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 35

Variable metric updating: the choice of the matrix Dk

➤ Since ∂ ∂xj G(x, ¯ x) = (V (¯ x))j xj ¯ xj ζmax−1 − (U(¯ x))j xj ¯ xj ζmin−1 the corresponding MM method (based on ∇G(x, x(k)) = 0) leads to

x(k+1) = argmin

x≥0

G(x, x(k)) = x(k) U(x(k)) V (x(k))

1

ζmax−ζmin

➤ In the special case of Least-Squares or Kullback-Leibler divergence, (ζmax − ζmin) = 1 and x(k+1) = x(k)

U(x(k))

V (x(k))

= x(k) −

x(k) V (x(k))∇D(Hx(k), g) Thus, the special scaled gradient step is a descent step for D(Hx(k), g).

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 36

Variable metric updating: the choice of the matrix Dk

2Hx + bg − g2

x(k+1) =x(k) HT g HT (Hx(k) + bg) = x(k)− x(k) HT (Hx(k) + bg)∇D(Hx(k), g), x(0) > 0

Expectation Maximization (EM) or Richardson-Lucy (RL) algorithm min

x≥0 D(Hx, g) ≡ n

i=1
gi log

gi (Hx + bg)i + (Hx + bg)i − gi

x(k+1) = x(k)

HT 1HT g Hx(k) + bg = x(k) − x(k) HT 1∇D(Hx(k), g) x(0) > 0

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 37

Variable metric updating: the choice of the matrix Dk

The split gradient idea within scaled gradient projection schemes: x(k+1) = x(k) + ϑk

PΩ,D−1

k (x(k) − αkDk∇f(x(k))) − x(k)

D(k)

i

= min

µk, max
1

µk , x(k)

i

Vi(x(k))

,

Vi(x(k)) > 0, i = 1, . . . , N, similar idea used in [Hager-Mair-Zhang, Math. Program. (2009)]:

D(k)

i

= αkx(k)

i

x(k)

i

+ αk

∇f(x(k))

+

i

, i = 1, . . . , N, (t)+ = max{0, t} works for more general constraints:

[Hager-Zhang, COAP (2014); Bonettini et al. SIAM J. Sci. Comp. 2015]

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 38

The steplengths in Scaled Gradient Methods

Consider the scaled gradient method: x(k+1) =x(k) − αkDkg(k)

Recall the quadratic case: min f(x) = 1

2xTAx − bTx

consider the problem ˜ f(y) = 1

2yT D

1 2 AD 1 2 y − bT D 1 2 y

and y(k+1) =y(k) − αk˜ g(k), ˜ g(k) = ∇ ˜ f(y(k)) Let y(k) = D− 1

2 x(k);

we have ˜ g(k) = D

1 2 g(k)

and y(k+1) = D− 1

2 (x(k) − αkDg(k)) = D− 1 2 x(k+1)

gradient step on y(k) ↔ scaled gradient step on x(k) ➤ Exploit the BB rules defined for the preconditioned problems by using u(k−1) = D− 1

2

x(k) − x(k−1)

, v(k−1) = D

1 2 (g(k) − g(k−1))

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 39

The steplengths in Scaled Gradient Methods

Consider the scaled gradient method: x(k+1) =x(k) − αkDkg(k)

The BB rules with scaling:

Let s(k−1) =

x(k) − x(k−1)

, z(k−1) =

g(k) − g(k−1)

, u(k−1) = D− 1

2 s(k−1),

v(k−1) = D

1 2 z(k−1),

define αkBB1 = u(k−1)T u(k−1) u(k−1)T v(k−1) = s(k−1)T D−1

k s(k−1)

s(k−1)T z(k−1) αkBB2 = u(k−1)T v(k−1) v(k−1)T v(k−1) = s(k−1)T z(k−1) z(k−1)T Dkz(k−1)

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 40

The steplengths in Scaled Gradient Methods

Consider the scaled gradient method: x(k+1) =x(k) − αkDkg(k)

Another interpretation of the scaled BB rules

Force the matrix (αkDk)−1 to approximate the Hessian ∇2f(x(k)) by imposing quasi-Newton properties in variable metric αBB1

k

= s(k−1)T D−1

k s(k−1)

s(k−1)T z(k−1) = argmin

α∈R

(αkDk)−1s(k−1) − z(k−1)Dk

r

αBB2

k

= s(k−1)T z(k−1) z(k−1)T Dkz(k−1) = argmin

α∈R

s(k−1) − (αkDk)z(k−1)D−1

k

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 41

The steplengths in Scaled Gradient Projection Methods

On the basis of the previous remarks, in case of box-constrained problems, instead of the standard BB2 rule αBB2

k

= s(k−1)T z(k−1) z(k−1)T Dkz(k−1) try to exploit αMBB2

k

= s(k−1)

Ik−1 T z(k−1) Ik−1

z(k−1)

Ik−1 T (Dk)Ik−1,Ik−1z(k−1) Ik−1

where

Ik−1 = {1, ..., n} \ Jk−1 Jk−1 = {i : (x(k−1)

i

= ℓi ∧ g(k−1)

i

≥ 0) ∨ (x(k−1)

i

= ui ∧ g(k−1)

i

≤ 0)}

[Crisci-Porta-Ruggiero-Zanni, (2019)]

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 42

Example: 3D image reconstruction from limited tomographic data

min

x≥0 J(x) + βJR(x)

Least-squares divergence

J(x) = 1 2Ax − b2 ∇J(x(k)) = AT Ax(k) − AT b

Edge preserving regularizer

JR(x) = Nx

jx=1

Ny

jy=1

Nz

jz=1

δ2xjx,jy,jz + δ2

δ2xjx,jy,jz = (xjx+1,jy,jz − xjx,jy,jz)2 + (xjx,jy+1,jz − xjx,jy,jz)2 +(xjx,jy,jz+1 − xjx,jy,jz)2 ∇JR(x(k)) = V (k) − U (k), V (k) > 0, U (k) ≥ 0

Scaling matrix derived by the gradient splitting

Dk = min

µk, max
1

µk , diag

x(k)

AT Ax(k)+βV (k)

,

µk =

1 +

M (k+1)2.1

[Piccolomini-Coli-Morotti-Zanni, Comput. Optim. Appl. (2018)]

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 43

Simulations on the 3D Shepp Logan phantom

Test problem features exact volume x∗: Shepp Logan phantom with Nv = 613 ≈ 226K voxels projection matrix A ∈ MNp×Nv with Nθ = 19 → Np = 612 × Nθ ≈ 70K

(http://www.imm.dtu.dk/~pcha/TVReg/)

Test platform: Test performed in Matlab 2016a on Intel core I7 6700 Compared methods ➤ GP equipped with BB1, BB2, MBB2, MABB steplengths ➤ SGP equipped with BB1, MBB2, MABB steplengths ➤ FISTA and Scaled FISTA algorithms

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 44

3D CT image reconstruction: the steplength behaviour

➤ GP and SGP: the new rules within alternated strategies are preferable

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 45

3D CT image reconstruction: the reconstruction error

➤ SGP and SFISTA preferable when a reconstruction is required in few seconds After 5 sec. After 10 sec. After 20 sec.

5 10 15 20

Iterations

0.2 0.4 0.6 0.8

RRE

GPMABB SGPMABB FISTA SFISTA

10 20 30 40

Iterations

10 -1 10 0

RRE

GPMABB SGPMABB FISTA SFISTA

20 40 60 80

Iterations

10 -1 10 0

RRE

GPMABB SGPMABB FISTA SFISTA

5 10 15 20

Iterations

10 0 10 2

(f(x (k) )-f *)/f *

GPMABB SGPMABB FISTA SFISTA 20 40

Iterations

10 -2 10 0 10 2 10 4

(f(x (k) )-f *)/f *

GPMABB SGPMABB FISTA SFISTA

20 40 60 80

Iterations

10 0

(f(x (k) )-f *)/f *

GPMABB SGPMABB FISTA SFISTA

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 46

Scaled FISTA for minx∈Ω f(x)

f convex, ∇f Lips. continuous on Ω, dom(f) ⊇ Ω, X∗ = ∅ y(k) = PΩ,D−1

k

x(k) + βk(x(k) − x(k−1))
new extrapolation step

x(k+1) = PΩ,D−1

k

y(k) − αkDk∇f(y(k))
Convergence analysis [Bonettini-Porta-Ruggiero, SIAM J. Sci. Comput. 2016]

αk |

f(x(k+1))≤f(y(k)) + ∇f(y(k))T (x(k+1) − y(k)) +

1 2αk

x(k+1) − y(k)

2

D−1

k

β0 = 0, βk = k−1

k+a,

k = 1, . . . , a ≥ 2 µ2

k = 1 + γk,

γk ≥ 0, ∞

k=0 γk < ∞,

Dk s.p.d. with eigenvalues in 1

µk , µk

➩

f(x(k)) − f ∗ ≤ C (k − 1 + a)2 , k = 1, 2, . . . a > 2 ⇒ limk→∞ x(k) = x∗

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 47

Reconstructions (LS + TV)

➤ After 5 sec. GP MABB SGP MABB SFISTA ➤ After 20 sec. GP MABB SGP MABB SFISTA

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 48

Comparison with the true image

➤ After 5 sec. true image SGP MABB SFISTA ➤ After 20 sec. true image SGP MABB SFISTA

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

SLIDE 49

Conclusions

➤ Spectral properties of steplength rules in gradient methods:

useful for understanding the behaviour of standard rules useful for designing improved selection rules

➤ Analysis of steplength rules in box-constrained problems:

suitable modification of state-of-the-art BB rules are suggested

➤ Analysis of steplength rules in scaled gradient projection methods:

Ad hoc BB rules exploiting spectral properties and scaling matrices

Work in progress

More general constraints: e.g. Ω = {l ≤ x ≤ u, aT x = b} preliminary results confirm the importance of the spectral analysis Possible extension to stochastic gradient approaches

References and software: www.oasis.unimore.it

L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

Spectral properties of steplength selections in gradient methods: from unconstrained to constrained optimization

Outline

Gradient methods for unconstrained problems Spectral properties of steplength selections Design selection rules by exploiting spectral properties From the quadratic case to general unconstrained problems

Gradient projection methods for box-constrained problems Spectral properties of steplengths in the quadratic case New steplength rules taking into account the constraints

Scaled gradient projection methods Define the diagonal scaling The steplengths in variable metric approaches Practical behaviour in imaging

Conclusions

Motivation for the steplength analysis

Constrained optimization problems

min

(1) f : RN − → R continuously differentiable function Ω ⊂ RN, nonempty closed convex set defined by simple constraints

Gradient Projection (GP) methods for minx∈Ω f(x)

x(k+1) = x(k) + ϑkd(k) d(k) = PΩ

αk > 0, ϑk ∈ (0, 1], PΩ(x) = argminz∈Ω z − x Usually the updating rules for the steplength αk are those exploited in the unconstrained case: is this a suitable choice?

Spectral analysis of steplength selections

➤ The unconstrained case ➤ The box-constrained case ➤ The Scaled Gradient Projection methods

Steplength selection: the unconstrained case

The recipe exploited by state-of-the-art selection rules: define steplengths by trying to capture, in an inexpensive way, some second order information design selection rules in the strictly convex quadratic case: f(x) = 1 2xT Ax − bT x, A symmetric positive definite

design selection rules that generalize, in an inexpensive way, to non-quadratic cases

A popular example: the Barzilai-Borwein (BB) selection rules

Consider the gradient method for the problem min f(x): x(k+1) = x(k) − αk∇f(x(k)) k = 0, 1, . . . ,

Suggestion [Barzilai-Borwein, IMA J. Num. Anal. 1988]:

Force the matrix (αkI)−1 to approximate the Hessian ∇2f(x(k)) by imposing quasi-Newton properties αBB1

= argmin

(αI)−1s(k−1) − z(k−1) = s(k−1)T s(k−1) s(k−1)T z(k−1)

αBB2

= argmin

s(k−1) − (αI)z(k−1) = s(k−1)T z(k−1) z(k−1)T z(k−1) where s(k−1) =

, z(k−1) = (∇f(x(k)) − ∇f(x(k−1))).

Spectral properties of the BB steplength rules

Consider a gradient method for the quadratic unconstrained case: min f(x) ≡ 1 2xT Ax − bT x, A = diag(λ1, . . . , λN), 0 < λ1 < · · · < λN x(k+1) = x(k) − αkg(k), g(k) = ∇f(x(k)), k = 0, 1, . . .

➩

g(k+1)

= (1 − αkλi)g(k)

i = 1, . . . , N

⇒ g(k+1)

= 0 ⇒ g(k+j)

= 0, j = 2, 3 . . .

i = 1, . . . , N ⇒ g(k+N) = 0 (Finite Termination) αk must aim at approximating the inverse of the eigenvalues of A

BB rules in the quadratic case

Example

Quadratic case: exploiting spectral properties

In the quadratic case (A = diag(λ1, . . . , λN),

Practical implementations of this idea: ABB and ABBmin rules Alternate Barzilai-Borwein selection rule [Zhou-Gao-Dai, COAP (2006)]

αABB

=      αBB2

if

< τ, τ ∈ (0, 1) αBB1

ABBmin rule [Frassoldati-Zanghirati-Zanni, JIMO (2008)]

α

=   

where Mα > 0 is a parameter.

ABB and ABBmin rules on the previous toy problem

Similar behaviour on randomly generated test problems

Quadratic test problems: N = 1000

Other efficient steplength rules based on spectral properties

Aimed at breaking the well-known cycling behaviour of the Steepest Descent method they share R-linear convergence rate in the quadratic case not all these rules easily generalize to general non-quadratic problems (BB-based rules have this crucial property)

General unconstrained problems: minx∈RN f(x)

Gradient methods with nonmonotone linesearch:

The standard BB rules can be improved

The steplengths mimic the behaviour in the quadratic case

Convex2 test problems: N = 100

When the Hessian eigenvalues stabilize, the steplengths ex- hibit the spectral properties observed in the quadratic case, but with respect to the spectrum of the current Hessian.

Spectral analysis of steplength selections

➤ The unconstrained case ➤ The box-constrained case ➤ The Scaled Gradient Projection methods

What about the constrained case?

Constrained minimization problems: minx∈Ω f(x)

x(k+1) = x(k) + ϑkd(k), d(k) = PΩ

The simplest case: box-constrained quadratic problems

min

2xT Ax − bT x, A sym. pos. def., l, u ∈ Rn Gradient Projection (GP) method

Box-constrained QP: minℓ≤x≤u f(x) ≡ 1

       g(x∗)i = 0 for ℓi < x∗

(i ∈ I∗) g(x∗)i ≤ 0 for x∗

(i ∈ J ∗) g(x∗)i ≥ 0 for x∗

(i ∈ J ∗) Define the set of indices

Possible idea

Since g(x∗)i = 0, i ∈ I∗, exploit the steplength rules to accelerate the reduction of |g(k−1)

|, i ∈ Ik−1, as done in the unconstrained case

Are the BB steplength rules useful to this purpose?

What about s(k−1)?