Optimized Polar Decomposition
for Modern Computer Architectures
Pierre.Blanchard†@manchester.ac.uk NLA Group Meeting Manchester, UK. October 2, 2018
†School of Mathematics, The University of Manchester
Joint work with
Optimized Polar Decomposition for Modern Computer Architectures - - PowerPoint PPT Presentation
Optimized Polar Decomposition for Modern Computer Architectures Joint work with Pierre.Blanchard @manchester.ac.uk NLA Group Meeting Manchester, UK. October 2, 2018 School of Mathematics, The University of Manchester Introduction
for Modern Computer Architectures
Pierre.Blanchard†@manchester.ac.uk NLA Group Meeting Manchester, UK. October 2, 2018
†School of Mathematics, The University of Manchester
Joint work with
2
Context
NLAFET (H2020 Project) - end April 2019
(D&C or QR).
3
Polar Decomposition
The Polar Decomposition For all non-singular A ∈ Cm×n, there exists a unique decomposition A = UH where
Reverse or Left PD A = HℓU, with Hℓ = UHU∗ ∈ Cm×m Relation with SVD
4
Polar Decomposition
Applications
If A = LLT (Chol) and LT = UH (PD)
H = A1/2
5
Polar Decomposition
Application to Matrix Decompositions
If A = UH (PD) and H = V ΣV T (EVD)
A = W ΣV T (SVD) with W = UV
6
Polar Decomposition
Algorithms Root finding algorithms on singular values of U.
e family)
7
Polar Decomposition
Algorithms Root finding algorithms on singular values of U.
e family)
and Freund, 2014] or [Higham, 2008, Ch.5].
7
State of the art implementation
Other implementations
8
9
Polar Factor: U = limk→∞ Xk
QDWH iterations [U] = qdwh(A, α, β, ε)
1
X0 = A/α, ℓ0 = β/α
2
k = 0
3
while |1 − ℓk| < ε
4
ak = h(ℓk), bk = g(ak), ck = ak + bk − 1
5 6
[Q1; Q2] R = qr( √ckXk; In
7
Xk+1 = (bk/ck)Xk + (1/ck)(ak − bk/ck)Q1QH
2
8 9
ℓk+1 = ℓk(ak + bkℓ2
k)/(1 + ckℓ2 k)
10
k = k + 1
11
end
12
U = Xk+1
10
Convergence of QDWH iterations
Number of iterations depends on conditioning κ2(A) = 1/ℓ0.
σi(Xk+1) = σi(Xk)ak + bkσ2
i (Xk)
1 + ckσ2
i (Xk)
convergence
11
Convergence of QDWH iterations
Estimating parameters α = A2 and ℓ0 = β/α with β = 1/A−12
α = AF or
ℓ0 = X01/(√nκ1(X0))
(8/3n3)
1 using QR + triangular solve (5/3n3)
Optimized QR it. [Nakatsukasa and Higham, 2013]
3)n3 to 5n3 12
Fast Cholesky iterations
Optimized QDWH Iterations [U] = qdwh(A, α, β, ε)
1
[. . .]
2
if ck < 100 // PO-based it.
3
Z = In + ckX ∗
k Xk
4
W = chol(Z)
5
Xk+1 = (ak/bk)Xk + (ak − bk/ck)
W −∗
6
else // QR-based it.
7
[. . .] Cholesky-based iterations are not stable [Nakatsukasa and Higham, 2012]
13
Matrix Decomposition
QDWH-PD [U, H] = qdwh − pd(A)
1
U = qdwh(A)
2
H = UHA +2m2n
3
H = (H + HH)/2 QDWH-SVD
= qdwh − svd(A)
1
[U, H] = qdwh − pd(A)
2
[V , Σ] = syev(H) +4n3
3
W = UV +2mn2 Additional cost
14
Flops count: QDWH-PD
Overall count1 of QDWH-PD with m = n
3
3
Nature of iterations with respect to condition number κ2 1 101-102 103-105 106-1013 1014-1016 QR 1 1 2 2 3 PO 3 4 3 4 3 flops 23 + 2
3
28 32 + 1
3
36 + 2
3
41 +itopt
QR
20 + 1
3
24 + 2
3
29 33 + 1
3
37 + 2
3
Table 1: # of QR and PO iterations and flops count (/n3) for QDWH-PD.
1without 5 3 n3 for estimating ℓ0 or exploiting trailing identity matrix structure.
15
Flops count: QDWH-PD
Overall count1 of QDWH-PD with m = n
3
3
Nature of iterations with respect to condition number κ2 1 101-102 103-105 106-1013 1014-1016 QR . . . 2 PO 2 . . . 4 flops 10 + 2
3
. . . 36 + 2
3
+itopt
QR
12 + 1
3
. . . 33 + 1
3
Table 1: # of QR and PO iterations and flops count (/n3) for QDWH-PD.
1without 5 3 n3 for estimating ℓ0 or exploiting trailing identity matrix structure.
15
Flops count: QDWH-SVD
3) ≤ #flops n3
≤ (36 + 2
3)
3
9 ) ≤ . . . ≤ (50 + 7 9 )
9 ) ≤ . . . ≤ (52 + 4 9 ) with vecs
Σ U, Σ, V dges(vd/dd) 2 + 2
3
22 2-stage-svd
10 3 = 3 + 1 3 10 3 + 4 = 7 + 1 3
QDWH-svd (+2-stage-eig) 12 ≤ . . . ≤ 38 14 + 2
3 ≤ . . . ≤ 40 + 2 3
(+QDWH-eig) 26 + 5
9 ≤ . . . ≤ 52 + 5 9
27 + 8
9 ≤ . . . ≤ 53 + 8 9
Table 2: Floating point operations counts (/n3) for SVD.
16
Flops Count
5 10 15 20 100 200 300 400
Matrix Size: n (/1.000) Flops count (GFlops)
Singular values only
2-stage-svd qdwh-svd = qdwh + syev qdwh-svd = qdwh + qdwh-eig
5 10 15 20 100 200 300 400
Matrix Size: n (/1.000) Flops count (GFlops)
Singular values and vectors
2-stage-svd qdwh-svd = qdwh + syev qdwh-svd = qdwh + qdwh-eig
17
Memory footprint
Stored matrices
√ckXk; In
⇒ 4mn + 3n2 entries
5 10 15 20 5 10 15 20
Number of rows: m (/1.000) Memory Footprint (GiB)
Real double precision - QDWH-PD
m = n m = 3n m = 10n
Memory available on Intel nodes or NVIDIA accelerators
18
19
Numerical Experiments
Real square matrices in double precision
Computer architectures (Intel)
Runtime systems
20
QDWH on runtime systems
2 4 6 8 10 12 50 100 150
Matrix Size: n (/1.000) Time (s)
Intel Haswell - 20 cores
QUARK qdwh QUARK qdwh opt. OpenMP qdwh OpenMP qdwh opt.
21
QDWH-SVD: Sandy Bridge
5 10 15 20 500 1,000 1,500
Matrix Size: n (/1.000) Time (s)
Sandy Bridge - Singular values only
2-stage-sdd qdwh-svd
5 10 15 20 500 1,000 1,500
Matrix Size: n (/1.000) Time (s)
Sandy Bridge - Singular values and vectors
2-stage-sdd qdwh-svd
κ2
#flops(QDWH−SVD) #flops(SDD)
SandyBridge Haswell KNL 1 7.8 1.5 1.4
1 1.6
1016 13 2.5 2.3
1 1.1
Table 3: Flop ratio vs speedup for QDWH-SVD (with vectors) and n = 14.000.
22
QDWH-SVD: Haswell vs KNL
2 4 6 8 10 12 14 16 100 200 300 400 500
Matrix Size: n (/1.000) Time (s)
Haswell - Singular values only
2-stage-svd 2-stage-sdd qdwh-svd
2 4 6 8 10 12 14 16 100 200 300 400 500
Matrix Size: n (/1.000) Time (s)
Haswell - Singular values and vectors
2-stage-svd 2-stage-sdd qdwh-svd
2 4 6 8 10 12 14 16 100 200 300 400 500
Matrix Size: n (/1.000) Time (s)
KNL - Singular values only
2-stage-svd 2-stage-sdd qdwh-svd
2 4 6 8 10 12 14 16 100 200 300 400 500
Matrix Size: n (/1.000) Time (s)
KNL - Singular values and vectors
2-stage-svd 2-stage-sdd qdwh-svd
23
QDWH-SVD: Plasma vs MKL
2 4 6 8 10 12 14 16 100 200 300 400 500
Matrix Size: n (/1.000) Time (s)
Haswell - Singular values only
dgesvd (Plasma) dgesvd (Lapack) dgesvd (MKL) qdwh-svd (Plasma)
2 4 6 8 10 12 14 16 100 200 300 400 500
Matrix Size: n (/1.000) Time (s)
Haswell - Singular values and vectors
dgesdd (Plasma) dgesdd (Lapack) dgesdd (MKL) qdwh-svd (Plasma)
2 4 6 8 10 12 14 16 100 200 300 400 500
Matrix Size: n (/1.000) Time (s)
KNL - Singular values only
dgesvd (Plasma) dgesvd (Lapack) dgesvd (MKL) qdwh-svd (Plasma)
2 4 6 8 10 12 14 16 100 200 300 400 500
Matrix Size: n (/1.000) Time (s)
KNL - Singular values and vectors
dgesdd (Plasma) dgesdd (Lapack) dgesdd (MKL) qdwh-svd (Plasma)
24
25
Ongoing work & Perspectives
StarPU implementation
Applications
Algorithms
26
27
References i
0196-5204. doi: 10.1137/0907079.
1185–1201, 2000. doi: 10.1137/S0895479899356080.
2018.
value decompositions. SIAM Rev. To appear, 2014.
10.1137/110857544.
and the SVD. SIAM Journal on Scientific Computing, 35(3):A1325–A1349, jan 2013. doi: 10.1137/120876605. URL https://doi.org/10.1137/120876605.
Analysis and Applications, 31(5):2700–2720, jan 2010. doi: 10.1137/090774999. URL https://doi.org/10.1137/090774999.
28