[PPT] - Low Rank Approximation Lecture 9 Daniel Kressner Chair for PowerPoint Presentation

SLIDE 1

Low Rank Approximation Lecture 9

Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch

1

SLIDE 2

Manifold optimization

General setting: Aim at solving optimization problem min

X∈Mr f(X),

where Mr is a manifold of rank-r matrices or tensors. Goal: Modify classical optimization algorithms (line search, Newton, quasi-Newton, ...) to produce iterates that stay on Mr. Advantages over ALS:

◮ No need to solve subproblems, at least for first-order methods; ◮ Can draw on concepts from classical smooth optimization (line

search strategies, convergence analysis, ...). Two valuable resources:

◮ Absil/Mahony/Sepulchre’2011:

Optimization Algorithms on Matrix

Manifolds. PUP

, 2008. Available from

https://press.princeton.edu/absil ◮ Manopt, a Matlab toolbox for optimization

n manifolds. Available from

https://manopt.org/

2

SLIDE 3

Manifolds

For open sets U ⊂ M, V ⊂ Rd chart is bijective function ϕ : U → V. Atlas of M into Rd is collection of charts (Uα, ϕα) such that:

◮ α Uα = M ◮ for any α, β with Uα ∩ Uβ = {∅}, change of coordinates

ϕβ ◦ ϕ−1

α

: Rd → Rd is smooth (C∞) on its domain ϕα(Uα ∩ Uβ).

Illustration taken from Wikipedia.

3

SLIDE 4

Manifolds

In the following, we assume that atlas is maximal. Proper definition of smooth manifold M needs further properties (topology induced by maximal atlas is Hausdorff and second-countable). See [Lee’2003] and [Absil et al.’2008]. Properties of M:

◮ finite-dimensional vector spaces are always manifolds; ◮ d = dimension of M; ◮ M does not need to be connected (in the context of smooth

ptimization makes sense to consider connected manifolds only);

◮ function f : M → R differentiable at point x ∈ M if and only if

f ◦ ϕ−1 : ϕ(U) ⊂ Rd → R is differentiable at ϕ(x) for some chart (U, ϕ) with x ∈ U.

4

SLIDE 5

Manifolds: First examples

Lemma

Let M be a smooth manifold and N ⊂ M an open subset. Then N is a smooth manifold (of equal dimension). Proof: Given atlas for M obtain atlas for N by selecting charts (U, ϕ) with U ⊂ N. Example: GL(n, R), the set of real invertible n × n matrices, is a smooth manifold.

EFY. Show that Rm×n ∗ , the set of real m × n matrices of full rank min{m, n}, is a smooth manifold. EFY. Show that the set of n × n symmetric positive definite matrices is a smooth manifold.

Two main classes of matrix manifolds:

◮ embedded submanifolds of Rm×n;

Example: Stiefel manifold of orthonormal bases.

◮ quotient manifolds;

Example: Grassmann manifold Rm×n

∗

/GL(n, R). Will focus on embedded submanifolds (much easier to work with).

5

SLIDE 6

Immersions and submersion

Let M1, M2 be smooth manifolds and F : M1 → M2. Let x ∈ M1 and y = F(x) ∈ M2. Choose charts ϕ1, ϕ2 around x, y. Then coordinate representation of F given by ˆ F := ϕ2 ◦ F ◦ ϕ−1

1

: Rd1 → Rd2.

◮ F is called smooth if ˆ

F is smooth (that is, C∞).

◮ rank of F at x ∈ M1 defined as the rank of D ˆ

F(ϕ(x1)) (Jacobian

f ˆ

F at ϕ(x1))

◮ F is called an immersion if its rank equals d1 at every x ∈ M1. ◮ F is called a submersion if its rank equals d2 at every x ∈ M1.

6

SLIDE 7

Embedded submanifolds

Subset N ⊂ M is called an embedded submanifold of dimension k in M if for each point p ∈ N there is a chart (U, ϕ) in M such that all elements of U ∩ N are obtained by varying first k coordinates only. (See Chapter 5 of [Lee’2003] for more details.)

Theorem

Let M, N be smooth manifolds and let F : M → N be a smooth map with constant rank ℓ. Then each level set F −1(y) := {x ∈ M : F(x) = y} is a closed embedded submanifold of codimension ℓ in M. Corollaries:

◮ If F : M → N is a submersion then each level is a closed

embedded submanifold of codimension equal to the dimension of N.

◮ In fact, by open submanifold lemma, only need to check full rank

condition of submersion for points in the level set (replace M by the open set for which F has full rank).

7

SLIDE 8

The Stiefel manifold

For m ≥ n, consider the set of all m × n matrices with orthonormal columns: St(m, n) := {X ∈ Rm×n : X TX = In}.

Corollary

St(m, n) is an embedded submanifold of Rm×n. Proof: Define F : Rm×n → symm(n) as F : X → X TX, where symm(n) denotes set of n × n symmetric matrices. At X ∈ St(m, n), consider Jacobian DF(X) : H → X TH + HTX. Given symmetric Y ∈ Rn×n, set H = XY/2. Then DF(X)[H] = Y; thus DF(X) is surjective.

EFY. What is the dimension of the Stiefel manifold? 8

SLIDE 9

The manifold of rank-k matrices

Locality of definition of embedded submanifolds implies the following lemma (Lemma 5.5 in [Lee’2003]).

Lemma

Let N be subset of smooth manifold M. Suppose every point p ∈ N has a neighborhood U ⊂ M such that U ∩ N is an embedded submanifold of U. Then N is an embedded submanifold of M.

Theorem

Given m ≥ n, the set Mk = {A ∈ Rm×n : rank(A) = k} is an embedded submanifold of Rm×n for every 0 ≤ k ≤ n.

9

SLIDE 10

The manifold of rank-k matrices

Choose arbitrary A0 ∈ Mk. After a suitable permutation, may assume w.l.o.g. that A0 = A11 A12 A21 A22

,

A11 ∈ Rk×k is invertible. This property remains true in an open neighborhood U ⊂ Rm×n of A0. Factorize A ∈ U as A =

I

A21A−1

11

I A11 A22 − A21A−1

11 A12

I A−1

11 A12

I

.

Define F : U → R(m−k)×(n−k) as F : A → A22 − A21A−1

11 A12. Then

F −1(0) = U ∩ Mk.

10

SLIDE 11

The manifold of rank-k matrices

For arbitrary Y ∈ R(m−k)×(n−k), we obtain that DF(A) Y

= Y.

Thus, F is a submersion. In turn, U ∩ Mk is an embedded submanifold of U. By lemma, Mk is an embedded submanifold of Rm×n.

EFY. What is the dimension of Mk ? EFY. Is Mk connected? EFY. Prove that the set of symmetric rank-k matrices is an embedded submanifold of Rn×n. Is this manifold connected? 11

SLIDE 12

Tangent space

In the following, much of the discussion restricted to submanifolds M embedded in vector space V with inner product ·, · and induced norm · . Given smooth curve γ : R → M with x = γ(0), we call γ′(0) ∈ V a tangent vector at x. The tangent space TxM ⊂ V is the set of all tangent vectors at x.

Lemma

TxM is a subspace of V.

Proof. If v1, v2 are tangent vectors then there are smooth curves

γ1, γ2 such that x = γ1(0) = γ2(0) and γ′

1(0) = v1, γ′ 2(0) = v2. To

show that αv1 + βv2 for α, β ∈ R is again a tangent vector, consider chart (U, ϕ) around x such that ϕ(x) = 0. Define γ(t) = ϕ−1(αϕ(γ1(t)) + βϕ(γ2(t))) for t sufficiently close to 0. Then γ(0) = x and γ′(0) = αv1 + βv2.

EFY. Prove that the dimension of Tx M equals the dimension of M using a coordinate chart. 12

SLIDE 13

Tangent space

Application of definition to Stiefel manifold. Let γ(t) = X + tY + O(t2) be a smooth curve with X ∈ St(m, n). To ensure that γ(t) ∈ St(m, n), we require In = γ(t)Tγ(t) = (X+tY)T(X+tY)+O(t2) = In+t(X TY+Y TX)+O(t2). Thus, X TY + Y TX = 0 characterizes tangent space: TxSt(m, n) = {Y ∈ Rm×n : X TY = −Y TX} = {XW + X⊥W⊥ : W ∈ Rn×n, W = −W T, W⊥ ∈ R(m−n)×n} where the columns of X⊥ form basis of span(X)⊥

13

SLIDE 14

Tangent space

When M is defined (at least locally) as level set of constant rank function F : V → RN, we have TxM = ker(DF(x)).

Proof. Let v ∈ TxM, that is, there is a curve γ : R → M such that

γ(0) = x and γ′(0) = v. Then, by chain rule, DF(x)[v] = DF(x)[γ′(0)] = ∂ ∂t F(γ(t))

t=0

= 0, because F is constant on M. Thus, TxM ⊂ ker(DF(x)), which completes the proof by counting dimensions.

14

SLIDE 15

Tangent space of Mk

Recall that Mk was obtained as level set of local submersion F : A → A22 − A21A−1

11 A12.

Given A ∈ Mk consider SVD A = U U⊥ Σ V V⊥ T . We have DF Σ

[H] = H22.

Thus, H is in the kernel if and only if H22 = 0. In terms of A this implies TAMk = ker(DF(A)) = U U⊥ Rk×k Rk×(n−k) R(m−k)×k V V⊥ T = {UMV T + UpV T + UV T

p : M ∈ Rk×k, UT p U = V T p V = 0}.

EFY. Compute the tangent space for the embedded submanifold of rank-k symmetric matrices. 15

SLIDE 16

Riemannian manifold and gradient

For submanifold M embedded in vector space V: Inner product ·, ·

n V induces inner product on TxM. This turns M into a Riemannian

manifold.1 The (Riemannian) gradient of smooth f : M → R at x ∈ M is defined as the unique element grad f(x) ∈ TxM that satisfies grad f(x), ξ = Df(x)[ξ], ∀ξ ∈ TxM.

EFY. Prove that the Riemannian gradient satisfies the steepest ascent property grad f(x) grad f(x)2 = arg max ξ∈Tx M ξ=1 Df(x)[ξ].

1In general, for a Riemannian manifold one needs to have an inner product on TxM

that varies smoothly wrt x.

16

SLIDE 17

Riemannian gradient

For submanifold M embedded in vector space V: The (Euclidean) gradient of f in V admits the decomposition ∇f(x) = Px∇f(x) + P⊥

x ∇f(x),

where Px, P⊥

x are the orthogonal projections onto TxM, T ⊥ x M. For

every ξ ∈ TxM we have Df(x)[ξ] = ∇f(x), ξ = Px∇f(x), ξ + P⊥

x ∇f(x), ξ

= Px∇f(x), ξ, where we used that P⊥

x ∇f(x) ⊥ ξ. Hence,

grad f(x) = Px∇f(x). The Riemannian gradient is the orthogonal projection of the Euclidean gradient onto the tangent space.

17

SLIDE 18

Riemannian gradient

Example: Given symmetric n × n matrix A, consider trace

ptimization problem

min

X∈St(n,k) trace(X TAX)

Study first-order perturbation trace((X + H)TA(X + H)) − trace(X TAX) = trace(HTAX) + trace(X TAH) + O(H2) = 2H, AX + O(H2). Euclidean gradient at X given by 2AX. Note that skew(W) = (W − W T)/2 is orth projection on skew-symmetric matrices. Thus, PX(Z) = (I − XX T)Z + X · skew(X TZ). grad f(X) = PX(∇f(X)) = 2(I − XX T)AX + 2X · skew(X TAX) = 2(AX − X X TAX).

18

SLIDE 19

Riemannian gradient

Example: For A ∈ Mk consider SVD A = UΣV T with Σ ∈ Rk×k. Define orthogonal projections onto span(U), span(V), and their complements: PU = UUT, P⊥

U = I − UUT, PV = VV T, P⊥ V = I − VV T.

Recall that TAMk = {UMV T + UpV T + UV T

p : M ∈ Rk×k, UT p U = V T p V = 0}

The three terms of the sum are orthogonal to each other and can thus be considered separately Orthogonal projection onto TAMk given by PA(Z) = PUZPV + P⊥

U ZPV + PUZP⊥ V .

EFY. Compute the Riemannian gradient of f(A) = A − B2 F on Mk for given B ∈ Rm×n. 19

SLIDE 20

Line search: Concepts from Euclidean case

min

x∈RN f(x),

Line search is optimization algorithm of the form xj+1 = xj + αjηj with search direction ηj and step size αj > 0.

◮ First-order optimal choice of ηj: ηj = −∇f(xj) gradient

descent. Motivation for other choices: Faster local convergence (Newton-type methods), exact gradient computation too expensive, . . . Gradient-related search directions: ηj, ∇f(xj) < δ < 0 for all j.

20

SLIDE 21

Line search: Concepts from Euclidean case

◮ Exact line search chooses

αj = arg min

α

f(xj + αηj). Only in exceptional cases simple optimization problem, e.g., admitting closed form solution.

EFY. Derive the closed form solution for exact line search applied to min x∈Rn 1 2 xT Ax + bT x for symmetric positive A ∈ Rn×n and b ∈ Rn.

Alternative: Armijo rule. Let β ∈ (0, 1) (typically β = 1/2) and c ∈ (0, 1) (e.g., c = 10−4) be fixed parameters. Determine largest αj ∈ {1, β, β2, β3, . . .} such that f(xj + αjηj) − f(xj) ≤ cαj∇f(xj)Tηj

holds. (Such αj always exists provided that ηj is descent

direction, i.e., when ηj, ∇f(xj) < 0.) More details in [J. Nocedal and S. J. Wright. Numerical optimization. Second edition. Springer Series in Operations Research and Financial Engineering. Springer, 2006].

21

SLIDE 22

Line search: Extension to manifolds

min

x∈M f(x)

Cannot use line search xk+1 = xj + αjηj, simply because addition is not well defined in M. Idea: Search along smooth curve γ(α) ∈ M with γ(0) = xj and γ′(0) = ηj ∈ TxjM. Step in direction xj + αηj ∈ xj + TxjM and go back to manifold via retraction:

22

SLIDE 23

Retraction

Tangent bundle TM :=

x∈M TxM

Definition

Mapping R : TM → M is called a retraction on M if for every X0 ∈ M there exists neighborhood U around (X0, 0) ∈ TM such that:

1. U ⊂ dom(R) and R|U : U → M is smooth.
2. R(x, 0) = x for all (x, 0) ∈ U.
3. DR(x, 0)[0, ξ] = ξ for all (x, ξ) ∈ U.

Will write Rx = R(x, ·) : TxM → M in the following. Intuition behind definition: Property 2 = retraction does nothing to elements on manifold. Property 3 = retraction preserves direction of curves. Equivalent characterization: For every tangent vector ξ ∈ TxM, the curve γ : α → Rx(αξ) satisfies γ′(0) = ξ.

EFY. What is a retraction for the manifold of invertible n × n matrices (trick question)? 23

SLIDE 24

Retraction

Exponential maps are most natural choice of retraction from theoretical point of view but often too expensive/too cumbersome to compute. In practice for matrix manifolds: Retractions are often built from matrix decompositions and metric projections. Example St(n, k): Given Y ∈ Rn×k

∗

(i.e., rank(Y) = k), the economy-sized QR decomposition Y = XR, X TX = Ik, R = ❅ ❅ ❅ is unique provided that diagonal elements of R are positive. This defines a diffeomorphism φ : St(n, k) × triu+(k) → Rn×k

∗

, φ : (X, R) → XR, where triu+(k) denotes upper triangular matrix with positive diagonal

elements. Applying φ−1 just means computing the QR
decomposition. Note that

dim St(n, k) + dim triu+(k) = dim Rn×k

∗

.

24

SLIDE 25

Retraction

Abstract setting: Let M be embedded submanifold of vector space V and N smooth manifold such that dim(M) + dim(N) = dim(V). Assume there is diffeomorphism φ : M × N → V∗ : (x, y) → φ(x, y) for some open subset V∗ of V. Moreover, assume ∃ neutral element id ∈ N such that φ(x, id) = x for all x ∈ M.

Lemma

Under above assumptions, Rx(η) := π1(φ−1(x + η)) is a retraction on M, where π1 is projection onto first component: π1(x, y) = x.

25

SLIDE 26

Retraction

Proof of lemma. Need to verify three properties of retraction. Property 1: Immediately follows from assumptions that Rx(ξ) is defined and smooth for all ξ in a neighborhood of 0 ∈ TxM. Property 2: Rx(0) := π1(φ−1(x)) = π1(x, id) = x. Property 3: Differentiating x = π1 ◦ φ−1(φ(x, id)) we obtain for any ξ ∈ TxM that ξ = D(π1 ◦ φ−1)[Dφ(x, id)[ξ, 0]] = D(π1 ◦ φ−1)(x)[ξ] = DRx(0)[ξ].

26

SLIDE 27

Retraction

For z ∈ V sufficiently close to M, metric projection is well defined: PM(z) := arg min

x∈M

z − x.

Corollary (Lewis/Malick’2008)

The map Rx(η) := PM(x + η) defines a retraction. Examples for retractions based on metric projection:

◮ For St(n, k), polar factor Y(Y TY)−1/2 of Y ∈ Rn×k ∗

defines a retraction.

◮ For rank-k matrix manifold Mk, best rank-k approximation Tk

defines a retraction. There are other choices; see [Absil/Oseledets’2015: Low-rank retractions: a survey and new results].

EFY. For all examples discussed so far, develop algorithms that efficiently realize the retraction by exploiting the structure of x + η. EFY. Find a retraction for the manifold of symmetric rank-k matrices. 27

SLIDE 28

Riemannian line search

xj+1 = Rxj(αjηj).

Assumption. Sequence {ηj} is bounded and gradient related:

lim sup

k→∞

grad f(xj), ηj < 0. Canonical choice: ηj = −grad f(xj). Extension of Armijo rule. Let β ∈ (0, 1) and c ∈ (0, 1) (e.g., c = 10−4) be fixed parameters. Determine largest αj ∈ {1, β, β2, β3, . . .} such that f(Rxj(αjηj)) − f(xj) ≤ cαjgrad f(xj), ηj (1) holds.

EFY. Show that the Armijo condition (1) can always be satisfied for sufficiently small αj . 28

SLIDE 29

Riemannian line search

1: for j = 0,1,2,. . . do 2:

Pick ηj ∈ TxjM such that sequence {ηj} is gradient-related.

3:

Choose αj ∈ {1, β, β2, β3, . . .} such that Armijo condition is satisfied.

4:

Set xj+1 = Rxj(αjηj).

5: end for

Convergence theory in Section 4.3 of [Absil’2008]. We call x∗ ∈ M a critical point of f if grad f(x∗) = 0.

Theorem

Every accumulation point of {xj} is a critical point of cost function f. More can be said if manifold (or at least level set) is compact.

Corollary

Assume that L = {x ∈ M : f(x) ≤ f(x0)} is compact. Then limj→∞ grad f(xj) → 0. Note that Mk is not compact and it is not clear a priori whether L is compact..

29

SLIDE 30

Application to low-rank matrix and tensor completion

30

SLIDE 31

Matrix Completion

PΩ A =      

recover?

A

PΩ : Rm×n → Rm×n, PΩ X =

Xij

if (i, j) ∈ Ω, else. Applications: image reconstruction, image inpainting, Netflix problem

Low-rank matrix completion:

min

X

rank(X) , X ∈ Rm×n subject to PΩ X = PΩ A

31

SLIDE 32

Low-rank matrix completion: ( NP-Hard)

min

X

rank(X) , X ∈ Rm×n subject to PΩ X = PΩ A

Nuclear norm relaxation: ( convex, but expensive)

min

X

X∗ =

i

σi , X ∈ Rm×n subject to PΩ X = PΩ A

Robust low-rank completion: (Assume rank is known)

min

X

1 2 PΩ X − PΩ A2

F ,

X ∈ Rm×n subject to rank(X) = k

Huge body of work! Overview: http://perception.csl.illinois.edu/matrix-rank/

32

SLIDE 33

Setting

minimize

X

f(X) := 1 2PΩ(X − A)2

F

subject to X ∈ Mk :=

X ∈ Rm×n : rank(X) = k
PΩ :

Rm×n → Rm×n Xij → Xij if (i, j) ∈ Ω, if (i, j) ∈ Ω. Riemannian gradient given by: grad f(X) = PTX Mk

PΩ(X − A)
with orthogonal projection PTX Mk : Rm×n → TXMk.

33

SLIDE 34

Geometric nonlinear CG for matrix completion

Input: Initial guess X0 ∈ Mk. η0 ← −grad f(X0) α0 ← argminα f(X0 + αη0) X1 ← RX0(α0η0) for i = 1, 2, . . . do Compute gradient: ξi ← grad f(Xi) Conjugate direction by PR+ updating rule: ηi ← −ξi + βiTXi−1→Xiηi−1 Initial step size from linearized line search: αi ← argminα f(Xi + αηi) Armijo backtracking for sufficient decrease: Find smallest integer m ≥ 0 such that f(Xi) − f(RXi(2−mαiηi)) ≥ −1 · 10−4ξi, 2−mαiηi Obtain next iterate: Xi+1 ← RXi(2−mαiηi) end for Cost/iteration: O((m + n)k2 + |Ω|k) ops.

34

SLIDE 35

Vector transport

Conjugate gradient method requires combination of gradients for subsequent iterates: grad f(X) ∈ TXMk, grad f(Y) ∈ TYMk ⇒ grad f(X) + grad f(Y) ??? Can be addressed by vector transport: TX→Y : TXMk → TYMk TX→Y(ξ) = PTY Mk (ξ). TX→Y(ξ) Mk X η ξ TYMk TXMk Can be implemented in O((m + n)k2) ops.

35

SLIDE 36

Numerical experiments

◮ Comparison to LMAFit [Wen/Yin/Zhang’2010].

http://lmafit.blogs.rice.edu/ .

◮ Oversampling factor OS = |Ω|/(k(2n − k)). ◮ Purely academic example A = ALAT R with AL, AR = randn.

36

SLIDE 37

Influence of n

50 100 150 200 10

−10

10

−5

10 iteration relative residual Convergence curve: k = 40, OS = 3 n=1000 n=2000 n=4000 n=8000 n=16000 n=32000

◮ Dashed lines: LMAFit. Solid lines: Nonlinear CG. ◮ time(1 iteration of Nonlinear CG)

≈ 2× time(1 iteration of LMAFit)

37

SLIDE 38

Influence of rank

50 100 150 200 10

−10

10

−5

10 iteration relative residual Convergence curve: n = 8000, OS = 3 k=10 k=20 k=30 k=40 k=50 k=60

◮ Dashed lines: LMAFit. Solid lines: Nonlinear CG. ◮ time(1 iteration of Nonlinear CG)

≈ 2× time(1 iteration of LMAFit)

38

SLIDE 39

Numerical experiments

◮ Comparison to LMAFit [Wen/Yin/Zhang’2010].

http://lmafit.blogs.rice.edu/ .

◮ Oversampling factor OS = |Ω|/(k(2n − k)) = 8. ◮ 8 000 × 8 000 matrix A is obtained from evaluating

f(x, y) = 1 1 + |x − y|2

n [0, 1] × [0, 1].

39

SLIDE 40

Influence of rank

◮ Hom: Start with k = 1 and subsequently increase k, using

previous result as initial guess.

40

SLIDE 41

Tensor Completion

Low-rank tensor completion:

min

X

rank(X) , X ∈ Rn1×n2×...×nd subject to PΩ X = PΩ A Applications:

◮ Completion of multidimensional data,

e.g. hyperspectral images, CT Scans

◮ Compression of multivariate

functions with singularities

◮ . . .

41

SLIDE 42

Manifold of Tensors of fixed multilinear rank

Mk :=

X ∈ Rn1×...×nd : rank(X) = k
,

dim(Mk) =

d

j=1

kj +

d

i=1
kini − ki(ki − 1)

2

.

◮ Mk is a smooth manifold. Discussed for more general formats in

[Holtz/Rohwedder/Schneider’2012], [Uschmajew/Vandereycken’2012]

◮ Riemannian with metric induced by standard inner product

X, Y = X(1), Y(1) (sum of element-wise product) Manifold structure used in

◮ dynamical low-rank approximation

[Koch/Lubich’2010], [Arnold/Jahnke’2012], [Lubich/Rohwedder/Schneider/Vandereycken’2012], [Khoromskij/Oseledets/Schneider’2012], . . .

◮ best multilinear approximation [Eldén/Savas’2009], [Ishteva/Absil/Van

Huffel/De Lathauwer’2011], [Curtef/Dirr/Helmke’2012]

42

SLIDE 43

Gradients and Tangent Space TXMk

Every ξ in the tangent space TX Mk at X = C ×1 U ×2 V ×3 W can be written as: ξ = S ×1 U ×2 V ×3 W + C ×1 U⊥ ×2 V ×3 W + C ×1 U ×2 V⊥ ×3 W + C ×1 U ×2 V ×3 W⊥ for some S ∈ Rk1×k2×k3, U⊥ ∈ Rn1×k1 with UT

⊥U = 0, . . .

Again, we obtain the Riemannian gradient of the objective function f(X) := 1 2 PΩ X − PΩ A2

F

by projecting the Euclidean gradient into the tangent space: grad f(X) = PTX Mk(PΩ X − PΩ A)

43

SLIDE 44

Retraction

Candidate for retraction: Metric projection RX (ξ) = PX (X + ξ) = arg min

Z∈Mk

X + ξ − Z. No closed-form solution available

◮ Replaced by HOSVD truncation. ◮ Seems to work fine. ◮ HOSVD truncation is a retraction

[K./Steinlechner/Vandereycken’13].

44

SLIDE 45

Geometric Nonlinear CG for Tensor Completion

Input: Initial guess X0 ∈ Mk. η0 ← −grad f(X0) α0 ← argminα f(X0 + αη0) X1 ← RX0(α0η0) for i = 1, 2, . . . do Compute gradient: ξi ← grad f(Xi) Conjugate direction by PR+ updating rule: ηi ← −ξi + βiTXi−1→Xiηi−1 Initial step size from linearized line search: αi ← argminα f(Xi + αηi) Armijo backtracking for sufficient decrease: Find smallest integer m ≥ 0 such that f(Xi) − f(RXi(2−mαiηi)) ≥ −1 · 10−4ξi, 2−mαiηi Obtain next iterate: Xi+1 ← RXi(2−mαiηi) end for Cost/iteration: O(nkd + |Ω|kd−1) ops.

45

SLIDE 46

Reconstruction of CT Scan

199 × 199 × 150 tensor from scaled CT data set “INCISIX”, (taken from OSIRIX MRI/CT data base

[www.osirix-viewer.com/datasets/]) Slice of original tensor HOSVD approx. of rank 21 Sampled tensor (6.7%) Low-rank completion of rank 21

Compares very well with existing results w.r.t. low-rank recovery and speed, e.g., [Gandy/Recht/Yamada/’2011].

46

SLIDE 47

Hyperspectral Image

Set of photographs, (204 × 268 px) taken across a large range of

wavelengths. 33 samples from ultraviolet to infrared [Image data:

Foster et al.’2004]

Stacked into a tensor of size 204 × 268 × 33

Completed Tensor, 16th Slice Final Rank is k = [50 50 6] 10% of the Original Hyperspectral Imega Tensor, 16th Slice Size of Tensor is [204, 268, 33]

Here: Only 10% of entries known; [Signoretti et al.’2011] use 50%.

47

SLIDE 48

How many samples do we need?

Matrix case: O(n · logβ n) samples suffice!

[Candès/Tao’2009]

⇒ Completion of tensor by applying matrix completion to matricization: O(n2 log(n)). Gives upper bound! Tensor case: Certainly: |Ω| ≪ O(n2) In all cases of convergence exact reconstruction. Conjecture: |Ω| = O(n · logβ n)

4 4.5 5 5.5 6 6.5 7 8.5 9 9.5 10 10.5 11 11.5 12 12.5 13 log(n) log( Smallest size of needed to converge ) y = 1.2*x + 4

O(n2) O(n) O(n log n) 48

SLIDE 49

Robustness of Convergence

5 10 15 20 25 30 10

16

10

14

10

12

10

8

10

6

10

4

10

2

10 10

2

Iteration

Rel. err on Omega and noise levels

Noisy completion, n = 100, k = [4, 5, 6]

◮ Random 100 × 100 × 100 tensor of multilinear rank (4, 5, 6)

perturbed by white noise.

◮ Upon convergence reconstruction up to noise level.

49

SLIDE 50

Final remarks on Riemannian low-rank optimization

◮ Only discussed first-order methods. Fine for well-conditioned

problems but slow convergence for ill-conditioned problems.

◮ Second-order methods (Newton-like) require Riemannian

Hessian: painful and:

◮ not of much help for well-conditioned problems (low-rank matrix

completion).

◮ linearized equations hard to solve efficiently for low-rank matrix and

tensor manifolds.

◮ Low-rank matrices/tensors can also be viewed as products of

quotient manifolds. Requires careful choice of metric to stay robust wrt small singular value σk [Ngo/Saad’2012], [Kasai/Mishra, ICML ’2016].

◮ Lots of open problems concerning convergence analysis of

low-rank Riemannian optimization!

50