Spectral properties of steplength selections in gradient methods: - - PowerPoint PPT Presentation

spectral properties of steplength selections in gradient
SMART_READER_LITE
LIVE PREVIEW

Spectral properties of steplength selections in gradient methods: - - PowerPoint PPT Presentation

Spectral properties of steplength selections in gradient methods: from unconstrained to constrained optimization L. Zanni Department of Physics, Informatics and Mathematics, University of Modena and Reggio Emilia, Italy Variational Methods and


slide-1
SLIDE 1

Spectral properties of steplength selections in gradient methods: from unconstrained to constrained optimization

  • L. Zanni

Department of Physics, Informatics and Mathematics, University of Modena and Reggio Emilia, Italy

Variational Methods and Optimization in Imaging

IHP - Paris, 4 - 8 February 2019 Joint work with:

  • S. Crisci, V. Ruggiero, University of Ferrara, Italy
  • F. Porta, University of Modena and Reggio Emilia, Italy
  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-2
SLIDE 2

Outline

1

Gradient methods for unconstrained problems Spectral properties of steplength selections Design selection rules by exploiting spectral properties From the quadratic case to general unconstrained problems

2

Gradient projection methods for box-constrained problems Spectral properties of steplengths in the quadratic case New steplength rules taking into account the constraints

3

Scaled gradient projection methods Define the diagonal scaling The steplengths in variable metric approaches Practical behaviour in imaging

4

Conclusions

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-3
SLIDE 3

Motivation for the steplength analysis

Constrained optimization problems

min

x∈Ω f(x)

(1) f : RN − → R continuously differentiable function Ω ⊂ RN, nonempty closed convex set defined by simple constraints

Gradient Projection (GP) methods for minx∈Ω f(x)

x(k+1) = x(k) + ϑkd(k) d(k) = PΩ

  • x(k) − αk∇f(x(k))
  • − x(k)

αk > 0, ϑk ∈ (0, 1], PΩ(x) = argminz∈Ω z − x Usually the updating rules for the steplength αk are those exploited in the unconstrained case: is this a suitable choice?

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-4
SLIDE 4

Spectral analysis of steplength selections

➤ The unconstrained case ➤ The box-constrained case ➤ The Scaled Gradient Projection methods

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-5
SLIDE 5

Steplength selection: the unconstrained case

The recipe exploited by state-of-the-art selection rules: define steplengths by trying to capture, in an inexpensive way, some second order information design selection rules in the strictly convex quadratic case: f(x) = 1 2xT Ax − bT x, A symmetric positive definite

second order information ↔ spectral properties of A

design selection rules that generalize, in an inexpensive way, to non-quadratic cases

∇2f(x(k)) depends on the iterations but ∇2f(x(k)) → ∇2f(x∗)

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-6
SLIDE 6

A popular example: the Barzilai-Borwein (BB) selection rules

Consider the gradient method for the problem min f(x): x(k+1) = x(k) − αk∇f(x(k)) k = 0, 1, . . . ,

Suggestion [Barzilai-Borwein, IMA J. Num. Anal. 1988]:

Force the matrix (αkI)−1 to approximate the Hessian ∇2f(x(k)) by imposing quasi-Newton properties αBB1

k

= argmin

α∈R

(αI)−1s(k−1) − z(k−1) = s(k−1)T s(k−1) s(k−1)T z(k−1)

  • r

αBB2

k

= argmin

α∈R

s(k−1) − (αI)z(k−1) = s(k−1)T z(k−1) z(k−1)T z(k−1) where s(k−1) =

  • x(k) − x(k−1)

, z(k−1) = (∇f(x(k)) − ∇f(x(k−1))).

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-7
SLIDE 7

Spectral properties of the BB steplength rules

Consider a gradient method for the quadratic unconstrained case: min f(x) ≡ 1 2xT Ax − bT x, A = diag(λ1, . . . , λN), 0 < λ1 < · · · < λN x(k+1) = x(k) − αkg(k), g(k) = ∇f(x(k)), k = 0, 1, . . .

g(k+1)

i

= (1 − αkλi)g(k)

i

i = 1, . . . , N

  • αk = 1

λi

⇒ g(k+1)

i

= 0 ⇒ g(k+j)

i

= 0, j = 2, 3 . . .

  • αk+i−1 = 1

λi ,

i = 1, . . . , N ⇒ g(k+N) = 0 (Finite Termination) αk must aim at approximating the inverse of the eigenvalues of A

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-8
SLIDE 8

BB rules in the quadratic case

1 λN ≤ αBB2

k

= g(k−1)T Ag(k−1) g(k−1)T A2g(k−1) ≤ αBB1

k

= g(k−1)T g(k−1) g(k−1)T Ag(k−1) ≤ 1 λ1

Example

f(x) = 1 2xT Ax − bT x A = diag(λ1, . . . , λ10), λi = 111i − 110 b random vector; bi ∈ [−10, 10] stopping rule: g(k) ≤ 10−8g(0)

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-9
SLIDE 9

Quadratic case: exploiting spectral properties

In the quadratic case (A = diag(λ1, . . . , λN),

0 < λ1 < · · · < λN), we have

  • g(k+1)

j

= (1 − αkλj)g(k)

j

j = 1, . . . , N

  • αk ≈ 1

λi ⇒             

  • g(k+1)

i

  • g(k)

i

  • very useful
  • g(k+1)

j

  • <
  • g(k)

j

  • if

j < i

useful

  • g(k+1)

j

  • >
  • g(k)

j

  • if

j > i, λj > 2λi

dangerous

  • αBB2

k

/αBB1

k

= cos2(g(k−1), Ag(k−1)) Idea for improving the BB rules:

force a sequence of small αBB2

k

to reduce |gi| for large i, leading to gradients in which these components are not dominant after a sequence of small αk, if αBB2

k

/αBB1

k

≈ 1, exploit αBB1 =

gT g gT Ag

aiming at obtaining αBB1 ≈ 1/λi for small i

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-10
SLIDE 10

Practical implementations of this idea: ABB and ABBmin rules Alternate Barzilai-Borwein selection rule [Zhou-Gao-Dai, COAP (2006)]

αABB

k

=      αBB2

k

if

αBB2

k

αBB1

k

< τ, τ ∈ (0, 1) αBB1

k

  • therwise

ABBmin rule [Frassoldati-Zanghirati-Zanni, JIMO (2008)]

α

ABBmin k

=   

min

  • αBB2

j

| j = max{1, k − Mα}, ..., k

  • if αBB2

k

/ αBB1

k

< τ αBB1

k

  • therwise

where Mα > 0 is a parameter.

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-11
SLIDE 11

ABB and ABBmin rules on the previous toy problem

20 40 60 80 100 120 140 10

−3

10

−2

10

−1

10

ABB Iterations αk

5 10 15 20 25 30 35 40 45 10

−3

10

−2

10

−1

10

ABBmin Iterations αk

Cauchy Steepest Descent (CSD) αk = argminα>0 f(x(k) − αkg(k)) BB1 → αk = αBB1

k

BB2 → αk = αBB2

k

ABB → alternation ABBmin → modified alternation

50 100 150 200 250 300 350 400 450 500 10

−6

10

−4

10

−2

10

Error Iterations ||xk−x*||/||x*||

CSD BB1 BB2 ABB ABBmin

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-12
SLIDE 12

Similar behaviour on randomly generated test problems

Quadratic test problems: N = 1000

λ1 = 1, λN = 104, λi, i = 2, . . . , N − 1, log-spaced λ = 1, λ = 103, λi = λ + (λ − λ) ∗ si, i = 1, . . . , N, si ∈ (0, 0.2), i = 1, . . . , N/2, si ∈ (0.8, 1), i = N/2 + 1, . . . , N.

[Di Serafino-Ruggiero-Toraldo-Z., AMC 2018]

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-13
SLIDE 13

Other efficient steplength rules based on spectral properties

[Pronzato-Zhigljavsky, Comput. Optim. Appl. 50 (2011)] [Fletcher, Math. Program. Ser. A 135 (2012)] [Pronzato-Zhigljavsky-Bukina, Acta Appl. Math. 127 (2013)] [De Asmundis-Di Serafino-Riccio-Toraldo, IMA J. Numer. Anal. 33 (2013)]] [De Asmundis-Di Serafino-Hager-Toraldo-Zhan, Comput. Optim. Appl. 59 (2014)] [Gonzaga-Schneider, Comput. Optim. Appl. 63 (2016)] [Gonzaga, Math. Program. Ser. A 160 (2016)]

Aimed at breaking the well-known cycling behaviour of the Steepest Descent method they share R-linear convergence rate in the quadratic case not all these rules easily generalize to general non-quadratic problems (BB-based rules have this crucial property)

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-14
SLIDE 14

General unconstrained problems: minx∈RN f(x)

Gradient methods with nonmonotone linesearch:

Init.: x(0) ∈ RN, 0 < αmin ≤ αmax, α0 ∈ [αmin, αmax], δ, σ ∈ (0, 1), M ∈ N; for k = 0, 1, . . . νk = αk; fref = max{f(x(k−j)), 0 ≤ j ≤ min(k, M)}; while f(x(k) − νkg(k)) > fref − σνkg(k)T g(k) (line search) νk = δνk; end x(k+1) = x(k) − νkg(k); define a tentative steplength αk+1 ∈ [αmin, αmax] end

➤ tentative steplength: exploit effective steplength selections designed for the quadratic case and generalizable in an inexpensive way. ➤ R-linear convergence of {f(x(k))} when f is strongly convex with Lipschitz-cont. gradient ([Dai, JOTA 2002], [Dai-Liao, IMA J.Num.Anal. 2002])

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-15
SLIDE 15

The standard BB rules can be improved

Trigonometric test problems: n = 50 f(x) = b − (Av(x) + Bu(x))2, v(x) = (sin(x1), ..., sin(xn))T , u(x) = (cos(x1), ..., cos(xn))T , A, B n × n random matrices integer entries in (−100, 100)

200 400 600 800 1000 1200 10-5 100 105

Convex2 test problems: N = 100 f(x) =

n

  • i=1

i 10(exi − xi);

20 40 60 80 100 120 10-10 10-5 100

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-16
SLIDE 16

The steplengths mimic the behaviour in the quadratic case

Convex2 test problems: N = 100

Green squares: 20 eigenvalues of ∇2f(x∗) with linearly spaced indices Blue dots: 20 eigenvalues of ∇2f(x(k)) with linearly spaced indices Red cross:

1 νk (black circles mean linesearch reductions)

20 40 60 80 100 120 10-1 100 101 20 40 60 80 iterations 10-1 100 101 1/νk ABBmin

When the Hessian eigenvalues stabilize, the steplengths ex- hibit the spectral properties observed in the quadratic case, but with respect to the spectrum of the current Hessian.

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-17
SLIDE 17

Spectral analysis of steplength selections

➤ The unconstrained case ➤ The box-constrained case ➤ The Scaled Gradient Projection methods

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-18
SLIDE 18

What about the constrained case?

Constrained minimization problems: minx∈Ω f(x)

x(k+1) = x(k) + ϑkd(k), d(k) = PΩ

  • x(k) − αk∇f(x(k))
  • − x(k)

PΩ(x) = argminz∈Ω z − x, Ω ⊂ RN More difficult analysis ➤ The goal is no more the gradient annihilation ➤ The gradient projection step makes the relation between successive gradients more complicated Motivation for generalizing the analysis ➤ BB rules are considered very effective also in the constrained case and were successfully exploited in many interesting applications ➤ ABB strategies seem still to outperform standard BB rules

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-19
SLIDE 19

The simplest case: box-constrained quadratic problems

min

ℓ≤x≤u f(x) ≡ 1

2xT Ax − bT x, A sym. pos. def., l, u ∈ Rn Gradient Projection (GP) method

Init.: ℓ ≤ x(0) ≤ u, 0 < αmin ≤ αmax, α0 ∈[αmin, αmax], δ, σ ∈ (0, 1), M ∈N; for k = 0, 1, . . . d(k) = Pℓ≤x≤u

  • x(k) − αkg(x(k))
  • − x(k);

(gradient projection step) ϑk = 1; fref = max{f(x(k−j)), 0 ≤ j ≤ min(k, M)}; while f(x(k) + ϑkd(k)) > fref + σϑkg(k)T d(k) (line search) ϑk = δϑk; end x(k+1) = x(k) + ϑkd(k); define the steplength αk+1 ∈ [αmin, αmax] (steplength updating rule) end

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-20
SLIDE 20

Box-constrained QP: minℓ≤x≤u f(x) ≡ 1

2xTAx − bTx The solution x∗ satisfies

       g(x∗)i = 0 for ℓi < x∗

i < ui

(i ∈ I∗) g(x∗)i ≤ 0 for x∗

i = ui

(i ∈ J ∗) g(x∗)i ≥ 0 for x∗

i = ℓi

(i ∈ J ∗) Define the set of indices

Jk−1 = {i : (x(k−1)

i

= ℓi ∧ g(k−1)

i

≥ 0) ∨ (x(k−1)

i

= ui ∧ g(k−1)

i

≤ 0)} Ik−1 = {1, ..., n} \ Jk−1

Possible idea

Since g(x∗)i = 0, i ∈ I∗, exploit the steplength rules to accelerate the reduction of |g(k−1)

i

|, i ∈ Ik−1, as done in the unconstrained case

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-21
SLIDE 21

Are the BB steplength rules useful to this purpose?

αBB1

k

= s(k−1)T s(k−1) s(k−1)T z(k−1) , αBB2

k

= s(k−1)T z(k−1) z(k−1)T z(k−1) ,

s(k−1) = (x(k) − x(k−1)) z(k−1) = (g(k)) − g(k−1))

What about s(k−1)?

(observe that x(k)

j

= x(k−1)

j

, for j ∈ Jk−1) s(k−1)

Jk−1 = 0 ⇒

                   αBB1

k

= s(k−1)

Ik−1 T s(k−1) Ik−1

s(k−1)

Ik−1 T z(k−1) Ik−1

= argmin

α∈R

(αkI)−1s(k−1)

Ik−1 − z(k−1) Ik−1

αBB2

k

= s(k−1)

Ik−1 T z(k−1) Ik−1

z(k−1)

Ik−1 T z(k−1) Ik−1 + z(k−1) Jk−1 T z(k−1) Jk−1

Only the αBB1

k

rule is able to capture the spectral properties of the Reduced Hessian AIk−1,Ik−1 at the current iteration: λmin(AIk−1,Ik−1) ≤ 1/αBB1

k

≤ λmax(AIk−1,Ik−1).

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-22
SLIDE 22

Box-constrained QP: different behaviour of αBB1

k

and αBB2

k TP1: n = 1000, 500 active const., λi(AIk−1,Ik−1) ∈ [10, 103] log-spaced

50 100 150 100 102 104

TP2: λi = M+m

2

+ M−m

2

cos( π(i−1)

n−1 ),

m = 1, M = 103

20 40 60 200 400 600 800 1000 50 100 200 400 600 800 1000

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-23
SLIDE 23

New proposals [Crisci-Ruggiero-Zanni, AMC 2019]

αBB2

k

= s(k−1)

Ik−1 T z(k−1) Ik−1

z(k−1)

Ik−1 T z(k−1) Ik−1 + z(k−1) Jk−1 T z(k−1) Jk−1

→ αMBB2

k

= s(k−1)

Ik−1 T z(k−1) Ik−1

z(k−1)

Ik−1 T z(k−1) Ik−1

Modified BB2 steplength rule

λmin(AIk−1,Ik−1) ≤ 1 αBB1

k

≤ 1 αMBB2

k

≤ λmax(AIk−1,Ik−1).

TP2: λi = M+m

2

+ M−m

2

cos( π(i−1)

n−1 ),

m = 1, M = 103

50 100 200 400 600 800 1000 10 20 30 40 50 60 70 200 400 600 800 1000

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-24
SLIDE 24

Modified BB2 can be exploited within ABB strategies

Modified ABBmin rule

α

MABBmin k

= min

  • αMBB2

j

| j = max{1, k − Mα}, ..., k

  • if αMBB2

k

/ αBB1

k

< τ αBB1

k

  • therwise

where Mα > 0 is a parameter.

TP3: n = 1000, 500 active const., λi(A) ∈ [1, 103] log-spaced

100 200 300 100 101 102 103 50 100 150 200 250 100 101 102 103

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-25
SLIDE 25

Performance profile: box-constrained QP test problems

Test Problems [Mor´

e –Toraldo, Num. Math. 1989]

108 box-const. QP, 15000 ≤ n ≤ 25000, K(A) = 104, 105, 106, nact = 0.1n, 0.5n, 0.9n

Methods GP method with nonmonotone linesearch and different steplength rules (BB2, MBB2, ABBmin, MABBmin ...) stopping rules:

ϕ(x(k))2 ≤ 10−5∇f(x(0))2, (ϕ(x))i =      (∇f(x))i , for xi = ℓi and xi = ui min

  • 0, (∇f(x))i
  • ,

for xi = ℓi max

  • 0, (∇f(x))i
  • ,

for xi = ui.

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-26
SLIDE 26

Performance profile (x ← Tsolver

Tmin , y ← %prob. solved within xTmin)

➤ The new MBB2 selection outperforms the standard BB2 rule

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 1.2 1.4 1.6 1.8 2 2.2 2.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

➤ Alternated strategies are preferable also in the constrained case

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-27
SLIDE 27

General box-constrained problems: minℓ≤x≤u f(x)

Test Problems [Facchinei-Judice-Soares, ACM TOMS 1997] min

ℓ≤x≤u f(x) ≡ g(x) +

  • i∈L

hi(xi) −

  • i∈U

hi(xi) L = {i | x∗

i = ℓi}

U = {i | x∗

i = ui},

g(x) =    Trigonometric Chained Rosenbrock Laplace2 hi(xi) =    βi(xi − x∗

i )

αi (xi − x∗

i )3 + βi (xi − x∗ i )

αi (xi − x∗

i )7/3 + βi (xi − x∗ i )

Problem size: n = 500; total number of problems: 108

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 1.5 2 2.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-28
SLIDE 28

Alternated BB are preferable

1 1.5 2 2.5 3 3.5 4 4.5 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Alternated BB in practical applications

Machine learning: decomp. techniques for training of Support Vector Machines [Serafini-Zanghirati-Z., Par. Comput. 2003, OMS 2005, JMLR 2006] Imaging problems in Astronomy, Microscopy, Computed Tomography

[Bonettini-Zanella-Z., Inv. Prob. 2009], [Loris et al. ACHA 2009], [Ruggiero et al. JGO 2010], [Prato et al., A&A 2012], [Zanella et al., Sci. Rep. 2013], [Piccolomini et al. COAP 2018]

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-29
SLIDE 29

Spectral analysis of steplength selections

➤ The unconstrained case ➤ The box-constrained case ➤ The Scaled Gradient Projection methods

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-30
SLIDE 30

Basic variable metric approaches

In many imaging applications the behaviour of gradient projection schemes is largely improved by exploiting variable metric approaches:

Scaled Gradient Projection (SGP) methods for minx∈Ω f(x)

x(k+1) = x(k) + ϑkd(k), ϑk ∈ (0, 1], d(k) = PΩ,D−1

k

  • x(k) − αkDk∇f(x(k))
  • − x(k),

αk > 0 PΩ,D−1

k (x)

=

argminz∈Ω z − xD−1

k

Dk sym. pos. def. matrix z − xD−1

k

  • (z − x)T D−1

k (z − x)

How can the matrix Dk be chosen? How can the steplength rules for αk be modified for taking into account the scaling matrix?

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-31
SLIDE 31

SGP methods: convergence analysis

x(k+1) = x(k) + ϑkd(k), ϑk ∈ (0, 1], d(k) = PΩ,D−1

k

  • x(k) − αkDk∇f(x(k))
  • − x(k),

αk > 0

Analysis of {x(k)} and {f(x(k))} [Bonettini-Prato, Inv. Prob. 2015]

  • Dk with eigenvalues in

1

µ, µ

  • , µ ≥ 1
  • αk ∈ [αmin, αmax],

0 < αmin ≤ αmax

  • ϑk | f(x(k+1)) ≤ f(x(k)) + σϑk∇f(x(k))T d(k)

⇒ If x(kl) l→∞ − − − → x∗ then ∇f(x∗)T (x − x∗) ≥ 0 ∀ x ∈ Ω ————————

  • f(x) convex, the solution set X∗ not empty
  • µ2

k = 1 + γk,

γk ≥ 0, ∞

k=0 γk < ∞

  • Dk s.p.d. with eigenvalues in

1

µk , µk

x(k) k→∞ − − − → x∗ x∗ ∈ X∗ ————————

  • ∇f is Lipschitz on Ω

⇒ f(k) − f∗ = O 1

k

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-32
SLIDE 32

Variable metric updating: the choice of the matrix Dk

A standard choice:

Dk = diag

  • D(k)

1 , D(k) 2 , . . . , D(k) N

  • D(k)

i

= min

  • ρ, max
  • 1

ρ, ∂2f(x(k)) (∂xi)2 −1 , i = 1, . . . , N,

Define Dk by exploiting only first-order information Consider the special non-negatively constrained case: minx≥0 f(x) and the corresponding KKT conditions ∇f(x) − ξ = 0, x ≥ 0, ξ ≥ 0, xiξi = 0, i = 1, . . . , N

x · ∇f(x) = 0, x ≥ 0, ∇f(x) ≥ 0 “ · ” denotes the component-wise product

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-33
SLIDE 33

Variable metric updating: the choice of the matrix Dk

Split the gradient [Lant´

eri-Roche-Aime, Inv. Prob. (2002)]:

∇f(x) = V (x) − U(x), V (x) > 0, U(x) ≥ 0 and use the splitting in the nonlinear equation x · ∇f(x) = 0: x · V (x) = x · U(x) = x · (−∇f(x) + V (x)),

x = x− x V (x)·∇f(x) = x−Dx∇f(x), Dx = diag

  • x1

V1(x), . . . , xN VN(x)

  • Iterative methods for x · ∇f(x) = 0 based on scaled gradient direction:

x(k+1) = x(k) − Dk∇f(x(k)), Dk = diag

  • x(k)

1

V1(x(k)), . . . , x(k)

N

VN(x(k))

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-34
SLIDE 34

Variable metric updating: the choice of the matrix Dk

The same suggestion arises from a Majorization-Minimiz. (MM) framework

[Yang-Oja, IEEE Trans. Neural Net.(2011)]

➤ Consider discrepancy funct. D(Hx, g), Hi,j ≥ 0, xi > 0 written as D(Hx, g) =

p

  • d=1

n

  • i=1

αd,ih((Hx)i, ζd), h(σ, t) = σt−1

t

if t = 0 log(σ) if t = 0 ➤ A surrogate function G(x, ¯ x) of D(Hx, g) at ¯ x up to an additive constant can be defined in terms of the splitting

∇D(H ¯ x, g) = V (¯ x) − U(¯ x), V (¯ x) > 0, U(¯ x) ≥ 0

G(x, ¯ x) =

n

  • j=1

¯ xj(V (¯ x))jh xj ¯ xj , ζmax

  • − ¯

xj(U(¯ x))jh xj ¯ xj , ζmin

  • where

ζmax = max

d∈{1,...,p} ζd,

ζmin = min

d∈{1,...,p} ζd

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-35
SLIDE 35

Variable metric updating: the choice of the matrix Dk

➤ Since ∂ ∂xj G(x, ¯ x) = (V (¯ x))j xj ¯ xj ζmax−1 − (U(¯ x))j xj ¯ xj ζmin−1 the corresponding MM method (based on ∇G(x, x(k)) = 0) leads to

x(k+1) = argmin

x≥0

G(x, x(k)) = x(k) U(x(k)) V (x(k))

  • 1

ζmax−ζmin

➤ In the special case of Least-Squares or Kullback-Leibler divergence, (ζmax − ζmin) = 1 and x(k+1) = x(k)

  • U(x(k))

V (x(k))

  • = x(k) −

x(k) V (x(k))∇D(Hx(k), g) Thus, the special scaled gradient step is a descent step for D(Hx(k), g).

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-36
SLIDE 36

Variable metric updating: the choice of the matrix Dk

Popular algorithms for imaging problems are based on this special scaling Iterative Space Reconstruction Algorithm (ISRA) min

x≥0 D(Hx, g) ≡ 1

2Hx + bg − g2

x(k+1) =x(k) HT g HT (Hx(k) + bg) = x(k)− x(k) HT (Hx(k) + bg)∇D(Hx(k), g), x(0) > 0

Expectation Maximization (EM) or Richardson-Lucy (RL) algorithm min

x≥0 D(Hx, g) ≡ n

  • i=1
  • gi log

gi (Hx + bg)i + (Hx + bg)i − gi

  • x(k+1) = x(k)

HT 1HT g Hx(k) + bg = x(k) − x(k) HT 1∇D(Hx(k), g) x(0) > 0

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-37
SLIDE 37

Variable metric updating: the choice of the matrix Dk

The split gradient idea within scaled gradient projection schemes: x(k+1) = x(k) + ϑk

  • PΩ,D−1

k (x(k) − αkDk∇f(x(k))) − x(k)

D(k)

i

= min

  • µk, max
  • 1

µk , x(k)

i

Vi(x(k))

  • ,

Vi(x(k)) > 0, i = 1, . . . , N, similar idea used in [Hager-Mair-Zhang, Math. Program. (2009)]:

D(k)

i

= αkx(k)

i

x(k)

i

+ αk

  • ∇f(x(k))

+

i

, i = 1, . . . , N, (t)+ = max{0, t} works for more general constraints:

[Hager-Zhang, COAP (2014); Bonettini et al. SIAM J. Sci. Comp. 2015]

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-38
SLIDE 38

The steplengths in Scaled Gradient Methods

Consider the scaled gradient method: x(k+1) =x(k) − αkDkg(k)

Recall the quadratic case: min f(x) = 1

2xTAx − bTx

consider the problem ˜ f(y) = 1

2yT D

1 2 AD 1 2 y − bT D 1 2 y

and y(k+1) =y(k) − αk˜ g(k), ˜ g(k) = ∇ ˜ f(y(k)) Let y(k) = D− 1

2 x(k);

we have ˜ g(k) = D

1 2 g(k)

and y(k+1) = D− 1

2 (x(k) − αkDg(k)) = D− 1 2 x(k+1)

gradient step on y(k) ↔ scaled gradient step on x(k) ➤ Exploit the BB rules defined for the preconditioned problems by using u(k−1) = D− 1

2

  • x(k) − x(k−1)

, v(k−1) = D

1 2 (g(k) − g(k−1))

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-39
SLIDE 39

The steplengths in Scaled Gradient Methods

Consider the scaled gradient method: x(k+1) =x(k) − αkDkg(k)

The BB rules with scaling:

Let s(k−1) =

  • x(k) − x(k−1)

, z(k−1) =

  • g(k) − g(k−1)

, u(k−1) = D− 1

2 s(k−1),

v(k−1) = D

1 2 z(k−1),

define αkBB1 = u(k−1)T u(k−1) u(k−1)T v(k−1) = s(k−1)T D−1

k s(k−1)

s(k−1)T z(k−1) αkBB2 = u(k−1)T v(k−1) v(k−1)T v(k−1) = s(k−1)T z(k−1) z(k−1)T Dkz(k−1)

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-40
SLIDE 40

The steplengths in Scaled Gradient Methods

Consider the scaled gradient method: x(k+1) =x(k) − αkDkg(k)

Another interpretation of the scaled BB rules

Force the matrix (αkDk)−1 to approximate the Hessian ∇2f(x(k)) by imposing quasi-Newton properties in variable metric αBB1

k

= s(k−1)T D−1

k s(k−1)

s(k−1)T z(k−1) = argmin

α∈R

(αkDk)−1s(k−1) − z(k−1)Dk

  • r

αBB2

k

= s(k−1)T z(k−1) z(k−1)T Dkz(k−1) = argmin

α∈R

s(k−1) − (αkDk)z(k−1)D−1

k

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-41
SLIDE 41

The steplengths in Scaled Gradient Projection Methods

On the basis of the previous remarks, in case of box-constrained problems, instead of the standard BB2 rule αBB2

k

= s(k−1)T z(k−1) z(k−1)T Dkz(k−1) try to exploit αMBB2

k

= s(k−1)

Ik−1 T z(k−1) Ik−1

z(k−1)

Ik−1 T (Dk)Ik−1,Ik−1z(k−1) Ik−1

where

Ik−1 = {1, ..., n} \ Jk−1 Jk−1 = {i : (x(k−1)

i

= ℓi ∧ g(k−1)

i

≥ 0) ∨ (x(k−1)

i

= ui ∧ g(k−1)

i

≤ 0)}

[Crisci-Porta-Ruggiero-Zanni, (2019)]

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-42
SLIDE 42

Example: 3D image reconstruction from limited tomographic data

min

x≥0 J(x) + βJR(x)

Least-squares divergence

J(x) = 1 2Ax − b2 ∇J(x(k)) = AT Ax(k) − AT b

Edge preserving regularizer

JR(x) = Nx

jx=1

Ny

jy=1

Nz

jz=1

  • δ2xjx,jy,jz + δ2

δ2xjx,jy,jz = (xjx+1,jy,jz − xjx,jy,jz)2 + (xjx,jy+1,jz − xjx,jy,jz)2 +(xjx,jy,jz+1 − xjx,jy,jz)2 ∇JR(x(k)) = V (k) − U (k), V (k) > 0, U (k) ≥ 0

Scaling matrix derived by the gradient splitting

Dk = min

  • µk, max
  • 1

µk , diag

  • x(k)

AT Ax(k)+βV (k)

  • ,

µk =

  • 1 +

M (k+1)2.1

[Piccolomini-Coli-Morotti-Zanni, Comput. Optim. Appl. (2018)]

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-43
SLIDE 43

Simulations on the 3D Shepp Logan phantom

Test problem features exact volume x∗: Shepp Logan phantom with Nv = 613 ≈ 226K voxels projection matrix A ∈ MNp×Nv with Nθ = 19 → Np = 612 × Nθ ≈ 70K

(http://www.imm.dtu.dk/~pcha/TVReg/)

Test platform: Test performed in Matlab 2016a on Intel core I7 6700 Compared methods ➤ GP equipped with BB1, BB2, MBB2, MABB steplengths ➤ SGP equipped with BB1, MBB2, MABB steplengths ➤ FISTA and Scaled FISTA algorithms

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-44
SLIDE 44

3D CT image reconstruction: the steplength behaviour

➤ GP and SGP: the new rules within alternated strategies are preferable

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-45
SLIDE 45

3D CT image reconstruction: the reconstruction error

➤ SGP and SFISTA preferable when a reconstruction is required in few seconds After 5 sec. After 10 sec. After 20 sec.

5 10 15 20

Iterations

0.2 0.4 0.6 0.8

RRE

GPMABB SGPMABB FISTA SFISTA

10 20 30 40

Iterations

10 -1 10 0

RRE

GPMABB SGPMABB FISTA SFISTA

20 40 60 80

Iterations

10 -1 10 0

RRE

GPMABB SGPMABB FISTA SFISTA

5 10 15 20

Iterations

10 0 10 2

(f(x (k) )-f *)/f *

GPMABB SGPMABB FISTA SFISTA 20 40

Iterations

10 -2 10 0 10 2 10 4

(f(x (k) )-f *)/f *

GPMABB SGPMABB FISTA SFISTA

20 40 60 80

Iterations

10 0

(f(x (k) )-f *)/f *

GPMABB SGPMABB FISTA SFISTA

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-46
SLIDE 46

Scaled FISTA for minx∈Ω f(x)

f convex, ∇f Lips. continuous on Ω, dom(f) ⊇ Ω, X∗ = ∅ y(k) = PΩ,D−1

k

  • x(k) + βk(x(k) − x(k−1))
  • new extrapolation step

x(k+1) = PΩ,D−1

k

  • y(k) − αkDk∇f(y(k))
  • Convergence analysis [Bonettini-Porta-Ruggiero, SIAM J. Sci. Comput. 2016]

αk |

f(x(k+1))≤f(y(k)) + ∇f(y(k))T (x(k+1) − y(k)) +

1 2αk

  • x(k+1) − y(k)

2

D−1

k

β0 = 0, βk = k−1

k+a,

k = 1, . . . , a ≥ 2 µ2

k = 1 + γk,

γk ≥ 0, ∞

k=0 γk < ∞,

Dk s.p.d. with eigenvalues in 1

µk , µk

f(x(k)) − f ∗ ≤ C (k − 1 + a)2 , k = 1, 2, . . . a > 2 ⇒ limk→∞ x(k) = x∗

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-47
SLIDE 47

Reconstructions (LS + TV)

➤ After 5 sec. GP MABB SGP MABB SFISTA ➤ After 20 sec. GP MABB SGP MABB SFISTA

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-48
SLIDE 48

Comparison with the true image

➤ After 5 sec. true image SGP MABB SFISTA ➤ After 20 sec. true image SGP MABB SFISTA

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019

slide-49
SLIDE 49

Conclusions

➤ Spectral properties of steplength rules in gradient methods:

useful for understanding the behaviour of standard rules useful for designing improved selection rules

➤ Analysis of steplength rules in box-constrained problems:

suitable modification of state-of-the-art BB rules are suggested

➤ Analysis of steplength rules in scaled gradient projection methods:

Ad hoc BB rules exploiting spectral properties and scaling matrices

Work in progress

More general constraints: e.g. Ω = {l ≤ x ≤ u, aT x = b} preliminary results confirm the importance of the spectral analysis Possible extension to stochastic gradient approaches

References and software: www.oasis.unimore.it

  • L. Zanni

Spectral properties of steplength selections in gradient methods Paris, 4 - 8 February 2019