[PPT] - Exploiting Multiple GPUs in Sparse QR: Regular Numerics with PowerPoint Presentation

SLIDE 1

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement

Tim Davis (Texas A&M University) with Sanjay Ranka, Mohamed Gadou (University of Florida) Nuri Yeralan (Microsoft) NVIDIA GTC 2015 March 2015

SLIDE 2

Outline

Combinatorial scientific computing: math + CS + applications Multifrontal methods for factorizing sparse matrices

fill-in creates cliques in the graph cliques connected via the elimination tree Sparse LU: square cliques assembled via addition Sparse LU: rectangular, via addition Sparse QR: rectangular, via concatenation

GPU kernel design for sparse QR Bucket scheduler for single-GPU QR Extended for multi-GPU QR Performance results

SLIDE 3

Combinatorial Scientific Computing: math + CS + applications

SLIDE 4

Cliques in the graph: the multifrontal method

Cliques + elimination tree = sequence of frontal matrices Dense factorization within a front; assemble data into parent regular + irregular computation

SLIDE 5

UMFPACK: unsymmetric multifrontal method

Frontal matrices become rectangular Assemble data into ancestors, not just parents

SLIDE 6

UMFPACK: unsymmetric multifrontal method

Key results / impact high-performance via dense matrix kernels within each front symbolic preordering and analysis, followed by revised local pivot search with approximate unsymmetric degree update widely used

sparse LU and x=A\b in MATLAB Mathematica IBM circuit simulation application finite-element solvers: NASTRAN, FEniCS, ... Berkeley Design Automation: circuit simulation CVXOPT ...

SLIDE 7

SuiteSparseQR: multifrontal sparse QR factorization

Key results / impact rectangular fronts like UMFPACK, but simpler frontal matrix assembly (concatenation, not summation) (Duff, Puglisi) rank approximation (Heath, extended to multifrontal case) multicore parallelism

n multicore CPU (70 Gflop theoretical peak): up to 30 Gflops

sparse qr in MATLAB, and x=A\b today’s talk: GPU algorithm

novel “Bucket QR” scheduler and custom GPU kernels up to 150 GFlops on one Kepler K20c, 286 GFlops on 4 Tesla C2070’s up to 28x speedup vs CPU algorithm (10x typical for large problems)

SLIDE 8

SuiteSparseQR

A column elimination tree and its supernodes

SLIDE 9

SuiteSparseQR

Frontal matrix assembly

SLIDE 10

SuiteSparseQR

concatenation, resulting a staircase matrix

SLIDE 11

SuiteSparseQR

SLIDE 12

Multifrontal factorization and assembly

Prior methods

ne front at a time on the GPU

assembly on CPU panel factorization on the CPU, applied on GPU

Our multifrontal QR

many fronts on the GPU (entire subtree) assembly on GPU: data concatenation, not summation entire dense QR of each front on the GPU

SLIDE 13

Consider a subtree of frontal matrices on the GPU

SLIDE 14

Expanded to show GPU kernel launches

SLIDE 15

Bucket QR factorization

SLIDE 16

Bucket QR factorization

SLIDE 17

Bucket QR factorization

SLIDE 18

Bucket QR factorization

SLIDE 19

Bucket QR factorization

SLIDE 20

Bucket QR factorization

SLIDE 21

Bucket QR factorization

SLIDE 22

Bucket QR factorization

SLIDE 23

Bucket QR factorization

SLIDE 24

SLIDE 25

GPU kernels for Bucket QR

Bucket QR requires two kernels on the GPU QR factorization of a t-by-1 tile,

t = 1, 2, or 3 creates a block Householder details on next slides

Apply a block Householder:

A = A − V (T ′(V ′A)) A is t-by-s, where s can be large thread-block iterates 2 column blocks at atime (details omitted)

SLIDE 26

GPU kernel: Block Householder QR

Block Householder QR = A where Q = (I − VT ′V ′)′, and R is upper triangular [m n] = size (A) for k = 1:n [tau, v] = house (A (k:m,k)) A (k:m,k:n) = A (k:m,k:n) - v * (tau * (v’ * A (k:m,k:n))) V (k:m,k) = v ; Tau (k) = tau end T = zeros (n) for k = 1:n tau = Tau (k) ; v = V (k:m,k) z = - tau * v’ * V (k:m,1:k-1) T (1:k-1,k) = T (1:k-1,1:k-1) * z’ T (k,k) = tau end

SLIDE 27

GPU kernel: Block Householder QR

Householder update

A(k:m,k:n) = A(k:m,k:n) - ... v(tau(v’*A(k:m,k:n)))

Construct T

z = - tau * v’ * V (k:m,1:k-1) T (1:k-1,k) = T(1:k-1,1:k-1)*z’ T (k,k) = tau

SLIDE 28

GPU kernel: Block Householder QR

Towards an GPU kernel

verwrite tril(A,-1) with V, and fold in construction of T.

[m n] = size (A) T = zeros (n) for k = 1:n [tau, v] = house (A (k:m,k)) A (k:m,k:n) = A (k:m,k:n) - ... v * (tau * (v’ * A (k:m,k:n))) V1 (k) = v (1) A (k+1:m,k) = v (2:end) z = - tau * v’ * A (k:m,1:k-1) T (1:k-1,k) = T (1:k-1,1:k-1) * z’ T (k,k) = tau end

SLIDE 29

GPU kernel: Block Householder QR

The GPU kernel update A and construct T in parallel: A (k:m,k:n) = A (k:m,k:n) - ... v * (tau * (v’ * A (k:m,k:n))) z =

tau * (v’ * A (k:m,1:k-1))

T (1:k-1,k) = T (1:k-1,1:k-1) * z’ T (k,k) = tau becomes z = -tau * v’ * A (k:m,:) A (k:m,k:n) = A (k:m,k:n) + v * z (k:n) T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1)’ T (k,k) = tau

SLIDE 30

GPU kernel: Block Householder QR

z = -tau * v’ * A (k:m,:) A (k:m,k:n) = ... A (k:m,k:n) + ... v * z (k:n) T (1:k-1,k) = ... T (1:k-1,1:k-1) ... * z (1:k-1)’ T (k,k) = tau

SLIDE 31

GPU kernel: Block Householder QR

The GPU kernel thread-level parallelism z = -tau * v’ * A (k:m,:) A (k:m,k:n) = A (k:m,k:n) + v * z (k:n) T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1)’ T (k,k) = tau A is 96-by-32, in register during factorization, and finally in global memory at the end (V and R) each thread owns an 8-by-1 “bitty block” of A v is 96-by-1, in shared memory z is 32-by-1, in shared. Requires 12-by-32 shared space for v’*A(k:m,:) reduction

SLIDE 32

GPU kernel: Block Householder QR

Putting it all together. At the kth step: threads that own column k write A to shared thread zero computes Householder coefficients z = -tau * v’ * A (k:m,:)

each thread computes 8-by-1 dot product in parallel writes scalar result in 12-by-32 reduction space warp zero sums reduction space to get z

A (k:m,k:n) = A (k:m,k:n) + v * z (k:n)

nly done by threads that own columns k:n

threads that own column k+1 compute norm of that column of A, for next Householder coef, saving result in 12-by-1 reduction space

T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1)’

nly done by threads 1:k-1

thread zero sums up reduction space for norm of column k+1

SLIDE 33

GPU kernel: Block Householder QR

z = -tau * v’ * A (k:m,:) each thread computes 8-by-1 dot product in parallel writes scalar result in 12-by-32 reduction space warp zero sums reduction space to get z

SLIDE 34

GPU kernel: Block Householder QR

A (k:m,k:n) = A (k:m,k:n) + v * z (k:n)

nly done by threads that
wn columns k:n

threads that own column k+1 compute norm of that column of A, for next Householder coef, saving result in 12-by-1 reduction space

SLIDE 35

GPU kernel: Block Householder QR

T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1)’

nly done by threads 1:k-1

SLIDE 36

Single GPU performance results

Putting it all together ... Performance results Fermi K20 K40 GPU kernels: apply block Householder 183 Gflops 260 Gflops 360 Gflops factorize 3 tiles 27 Gflops 20 Gflops dense QR for large front 107 Gflops 120 Gflops (’bare metal’ flops) 154 Gflops 172 Gflops sparse QR on GPU 80 Gflops 150 Gflops peak speedup over CPU 11x 20x typical speedup over CPU 5x 10x

SLIDE 37

GPU Kernel pitfalls

What we’ll do differently in our kernel design Householder block-apply using too much shared memory uberkernel approach

each thread block determines what to do from a task list (QR, apply, assemble) pros: single large kernel launch, lots of parallelism con: all tasks use same thread geometry con: QR of panel needs higher occupancy to hide scalar

(d)

to do: block apply kernel needs to stage A = A − WYA by not keeping W and Y in shared. split the uberkernel so QR panel can have higher occupancy

SLIDE 38

Single GPU performance on many matrices

SLIDE 39

Multiple GPUs: decomposing the tree

SLIDE 40

Multiple GPUs: Performance Results

Results on NVIDIA Tesla C2070 GPUs problem CPU 1 GPU 2 GPU 4 GPU speedup speedup GFlop GFlop GFlop GFlop vs CPU vs 1 GPU 1500:2D 6.1 16.0 27.1 38.4 6.3 2.4 2000:2D 6.9 21.0 37.8 56.7 8.2 2.7 3000:2D 7.8 25.8 44.8 73.7 9.4 2.9 lp nug20 23.9 74.3 86.4 66.1 2.8 0.9 ch7-8-b3 25.3 104.0 111.3 173.7 6.9 1.7 ch7-8-b3:8 10.0 88.0 160.4 286.2 28.6 3.3

SLIDE 41

Multiple GPUs: for large fronts

SLIDE 42

Multiple GPUs: bucket scheduler on the large scale

SLIDE 43

Multiple GPUs: bucket scheduler on the large scale

SLIDE 44

Multiple GPUs: bucket scheduler on the large scale

SLIDE 45

Multiple GPUs: bucket scheduler on the large scale

SLIDE 46

Multiple GPUs: bucket scheduler on the large scale

SLIDE 47

Multiple GPUs: bucket scheduler on the large scale

SLIDE 48

Multiple GPUs: bucket scheduler on the large scale

SLIDE 49

Multiple GPUs: bucket scheduler on the large scale

SLIDE 50

Multiple GPUs: bucket scheduler on the large scale

SLIDE 51

Multiple GPUs: bucket scheduler on the large scale

SLIDE 52

Multiple GPUs: bucket scheduler on the large scale

SLIDE 53

Multiple GPUs: bucket scheduler on the large scale

SLIDE 54

Multiple GPUs: bucket scheduler on the large scale

SLIDE 55

Multiple GPUs: bucket scheduler on the large scale

SLIDE 56

Multiple GPUs: bucket scheduler on the large scale

SLIDE 57

Multiple GPUs: bucket scheduler on the large scale

SLIDE 58

Multiple GPUs: bucket scheduler on the large scale

SLIDE 59

Multiple GPUs: bucket scheduler on the large scale

SLIDE 60

Multiple GPUs: bucket scheduler on the large scale

SLIDE 61

Multiple GPUs: bucket scheduler on the large scale

SLIDE 62

Multiple GPUs: bucket scheduler on the large scale

SLIDE 63

Multiple GPUs: bucket scheduler on the large scale

SLIDE 64

Multiple GPUs: bucket scheduler on the large scale

SLIDE 65

Multiple GPUs: bucket scheduler on the large scale

SLIDE 66

Multiple GPUs: bucket scheduler on the large scale

SLIDE 67

SLIDE 68

SLIDE 69

SLIDE 70

Acknowledgements: National Science Foundation NVIDIA Texas A&M University

SLIDE 71

Summary: Sparse QR on GPUs Fronts live and die on the GPU, reduces CPU-GPU traffic Bucket scheduler: extends Communication-Avoiding QR method Single GPU: speedup 5x to 20x on one GPU Multi GPU prototype: speedup over 3x on 4 GPUs Code: SuiteSparse.com and developer.nvidia.com/cholmod SuiteSparse logo, and music to art via math: NotesArtStudio.com

Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement

Tim Davis (Texas A&M University) with Sanjay Ranka, Mohamed Gadou (University of Florida) Nuri Yeralan (Microsoft) NVIDIA GTC 2015 March 2015

Outline

Combinatorial scientific computing: math + CS + applications Multifrontal methods for factorizing sparse matrices

fill-in creates cliques in the graph cliques connected via the elimination tree Sparse LU: square cliques assembled via addition Sparse LU: rectangular, via addition Sparse QR: rectangular, via concatenation

GPU kernel design for sparse QR Bucket scheduler for single-GPU QR Extended for multi-GPU QR Performance results

Combinatorial Scientific Computing: math + CS + applications

Cliques in the graph: the multifrontal method

Cliques + elimination tree = sequence of frontal matrices Dense factorization within a front; assemble data into parent regular + irregular computation

UMFPACK: unsymmetric multifrontal method

Frontal matrices become rectangular Assemble data into ancestors, not just parents

UMFPACK: unsymmetric multifrontal method

Key results / impact high-performance via dense matrix kernels within each front symbolic preordering and analysis, followed by revised local pivot search with approximate unsymmetric degree update widely used

sparse LU and x=A\b in MATLAB Mathematica IBM circuit simulation application finite-element solvers: NASTRAN, FEniCS, ... Berkeley Design Automation: circuit simulation CVXOPT ...

SuiteSparseQR: multifrontal sparse QR factorization

Key results / impact rectangular fronts like UMFPACK, but simpler frontal matrix assembly (concatenation, not summation) (Duff, Puglisi) rank approximation (Heath, extended to multifrontal case) multicore parallelism

sparse qr in MATLAB, and x=A\b today’s talk: GPU algorithm

novel “Bucket QR” scheduler and custom GPU kernels up to 150 GFlops on one Kepler K20c, 286 GFlops on 4 Tesla C2070’s up to 28x speedup vs CPU algorithm (10x typical for large problems)

SuiteSparseQR

A column elimination tree and its supernodes

SuiteSparseQR

Frontal matrix assembly

SuiteSparseQR

concatenation, resulting a staircase matrix

SuiteSparseQR

Multifrontal factorization and assembly

Prior methods

assembly on CPU panel factorization on the CPU, applied on GPU

Our multifrontal QR

many fronts on the GPU (entire subtree) assembly on GPU: data concatenation, not summation entire dense QR of each front on the GPU

Consider a subtree of frontal matrices on the GPU

Expanded to show GPU kernel launches

Bucket QR factorization

Bucket QR factorization

Bucket QR factorization

Bucket QR factorization

Bucket QR factorization

Bucket QR factorization

Bucket QR factorization

Bucket QR factorization

Bucket QR factorization

GPU kernels for Bucket QR

Bucket QR requires two kernels on the GPU QR factorization of a t-by-1 tile,

t = 1, 2, or 3 creates a block Householder details on next slides

Apply a block Householder:

A = A − V (T ′(V ′A)) A is t-by-s, where s can be large thread-block iterates 2 column blocks at atime (details omitted)

GPU kernel: Block Householder QR

GPU kernel: Block Householder QR

Householder update

A(k:m,k:n) = A(k:m,k:n) - ... v*(tau*(v’*A(k:m,k:n)))

Construct T

z = - tau * v’ * V (k:m,1:k-1) T (1:k-1,k) = T(1:k-1,1:k-1)*z’ T (k,k) = tau

GPU kernel: Block Householder QR

Towards an GPU kernel

[m n] = size (A) T = zeros (n) for k = 1:n [tau, v] = house (A (k:m,k)) A (k:m,k:n) = A (k:m,k:n) - ... v * (tau * (v’ * A (k:m,k:n))) V1 (k) = v (1) A (k+1:m,k) = v (2:end) z = - tau * v’ * A (k:m,1:k-1) T (1:k-1,k) = T (1:k-1,1:k-1) * z’ T (k,k) = tau end

GPU kernel: Block Householder QR

The GPU kernel update A and construct T in parallel: A (k:m,k:n) = A (k:m,k:n) - ... v * (tau * (v’ * A (k:m,k:n))) z =

T (1:k-1,k) = T (1:k-1,1:k-1) * z’ T (k,k) = tau becomes z = -tau * v’ * A (k:m,:) A (k:m,k:n) = A (k:m,k:n) + v * z (k:n) T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1)’ T (k,k) = tau

GPU kernel: Block Householder QR

z = -tau * v’ * A (k:m,:) A (k:m,k:n) = ... A (k:m,k:n) + ... v * z (k:n) T (1:k-1,k) = ... T (1:k-1,1:k-1) ... * z (1:k-1)’ T (k,k) = tau

GPU kernel: Block Householder QR

GPU kernel: Block Householder QR

Putting it all together. At the kth step: threads that own column k write A to shared thread zero computes Householder coefficients z = -tau * v’ * A (k:m,:)

each thread computes 8-by-1 dot product in parallel writes scalar result in 12-by-32 reduction space warp zero sums reduction space to get z

A (k:m,k:n) = A (k:m,k:n) + v * z (k:n)

threads that own column k+1 compute norm of that column of A, for next Householder coef, saving result in 12-by-1 reduction space

T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1)’

thread zero sums up reduction space for norm of column k+1

GPU kernel: Block Householder QR

z = -tau * v’ * A (k:m,:) each thread computes 8-by-1 dot product in parallel writes scalar result in 12-by-32 reduction space warp zero sums reduction space to get z

GPU kernel: Block Householder QR

A (k:m,k:n) = A (k:m,k:n) + v * z (k:n)

threads that own column k+1 compute norm of that column of A, for next Householder coef, saving result in 12-by-1 reduction space

GPU kernel: Block Householder QR

T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1)’

Single GPU performance results

GPU Kernel pitfalls

What we’ll do differently in our kernel design Householder block-apply using too much shared memory uberkernel approach

each thread block determines what to do from a task list (QR, apply, assemble) pros: single large kernel launch, lots of parallelism con: all tasks use same thread geometry con: QR of panel needs higher occupancy to hide scalar

to do: block apply kernel needs to stage A = A − WYA by not keeping W and Y in shared. split the uberkernel so QR panel can have higher occupancy

A(k:m,k:n) = A(k:m,k:n) - ... v(tau(v’*A(k:m,k:n)))