SLIDE 1
Exploiting Multiple GPUs in Sparse QR: Regular Numerics with - - PowerPoint PPT Presentation
Exploiting Multiple GPUs in Sparse QR: Regular Numerics with - - PowerPoint PPT Presentation
Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Tim Davis (Texas A&M University) with Sanjay Ranka, Mohamed Gadou (University of Florida) Nuri Yeralan (Microsoft) NVIDIA GTC 2015 March 2015 Outline
SLIDE 2
SLIDE 3
Combinatorial Scientific Computing: math + CS + applications
SLIDE 4
Cliques in the graph: the multifrontal method
Cliques + elimination tree = sequence of frontal matrices Dense factorization within a front; assemble data into parent regular + irregular computation
SLIDE 5
UMFPACK: unsymmetric multifrontal method
Frontal matrices become rectangular Assemble data into ancestors, not just parents
SLIDE 6
UMFPACK: unsymmetric multifrontal method
Key results / impact high-performance via dense matrix kernels within each front symbolic preordering and analysis, followed by revised local pivot search with approximate unsymmetric degree update widely used
sparse LU and x=A\b in MATLAB Mathematica IBM circuit simulation application finite-element solvers: NASTRAN, FEniCS, ... Berkeley Design Automation: circuit simulation CVXOPT ...
SLIDE 7
SuiteSparseQR: multifrontal sparse QR factorization
Key results / impact rectangular fronts like UMFPACK, but simpler frontal matrix assembly (concatenation, not summation) (Duff, Puglisi) rank approximation (Heath, extended to multifrontal case) multicore parallelism
- n multicore CPU (70 Gflop theoretical peak): up to 30 Gflops
sparse qr in MATLAB, and x=A\b today’s talk: GPU algorithm
novel “Bucket QR” scheduler and custom GPU kernels up to 150 GFlops on one Kepler K20c, 286 GFlops on 4 Tesla C2070’s up to 28x speedup vs CPU algorithm (10x typical for large problems)
SLIDE 8
SuiteSparseQR
A column elimination tree and its supernodes
SLIDE 9
SuiteSparseQR
Frontal matrix assembly
SLIDE 10
SuiteSparseQR
concatenation, resulting a staircase matrix
SLIDE 11
SuiteSparseQR
SLIDE 12
Multifrontal factorization and assembly
Prior methods
- ne front at a time on the GPU
assembly on CPU panel factorization on the CPU, applied on GPU
Our multifrontal QR
many fronts on the GPU (entire subtree) assembly on GPU: data concatenation, not summation entire dense QR of each front on the GPU
SLIDE 13
Consider a subtree of frontal matrices on the GPU
SLIDE 14
Expanded to show GPU kernel launches
SLIDE 15
Bucket QR factorization
SLIDE 16
Bucket QR factorization
SLIDE 17
Bucket QR factorization
SLIDE 18
Bucket QR factorization
SLIDE 19
Bucket QR factorization
SLIDE 20
Bucket QR factorization
SLIDE 21
Bucket QR factorization
SLIDE 22
Bucket QR factorization
SLIDE 23
Bucket QR factorization
SLIDE 24
SLIDE 25
GPU kernels for Bucket QR
Bucket QR requires two kernels on the GPU QR factorization of a t-by-1 tile,
t = 1, 2, or 3 creates a block Householder details on next slides
Apply a block Householder:
A = A − V (T ′(V ′A)) A is t-by-s, where s can be large thread-block iterates 2 column blocks at atime (details omitted)
SLIDE 26
GPU kernel: Block Householder QR
Block Householder QR = A where Q = (I − VT ′V ′)′, and R is upper triangular [m n] = size (A) for k = 1:n [tau, v] = house (A (k:m,k)) A (k:m,k:n) = A (k:m,k:n) - v * (tau * (v’ * A (k:m,k:n))) V (k:m,k) = v ; Tau (k) = tau end T = zeros (n) for k = 1:n tau = Tau (k) ; v = V (k:m,k) z = - tau * v’ * V (k:m,1:k-1) T (1:k-1,k) = T (1:k-1,1:k-1) * z’ T (k,k) = tau end
SLIDE 27
GPU kernel: Block Householder QR
Householder update
A(k:m,k:n) = A(k:m,k:n) - ... v*(tau*(v’*A(k:m,k:n)))
Construct T
z = - tau * v’ * V (k:m,1:k-1) T (1:k-1,k) = T(1:k-1,1:k-1)*z’ T (k,k) = tau
SLIDE 28
GPU kernel: Block Householder QR
Towards an GPU kernel
- verwrite tril(A,-1) with V, and fold in construction of T.
[m n] = size (A) T = zeros (n) for k = 1:n [tau, v] = house (A (k:m,k)) A (k:m,k:n) = A (k:m,k:n) - ... v * (tau * (v’ * A (k:m,k:n))) V1 (k) = v (1) A (k+1:m,k) = v (2:end) z = - tau * v’ * A (k:m,1:k-1) T (1:k-1,k) = T (1:k-1,1:k-1) * z’ T (k,k) = tau end
SLIDE 29
GPU kernel: Block Householder QR
The GPU kernel update A and construct T in parallel: A (k:m,k:n) = A (k:m,k:n) - ... v * (tau * (v’ * A (k:m,k:n))) z =
- tau * (v’ * A (k:m,1:k-1))
T (1:k-1,k) = T (1:k-1,1:k-1) * z’ T (k,k) = tau becomes z = -tau * v’ * A (k:m,:) A (k:m,k:n) = A (k:m,k:n) + v * z (k:n) T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1)’ T (k,k) = tau
SLIDE 30
GPU kernel: Block Householder QR
z = -tau * v’ * A (k:m,:) A (k:m,k:n) = ... A (k:m,k:n) + ... v * z (k:n) T (1:k-1,k) = ... T (1:k-1,1:k-1) ... * z (1:k-1)’ T (k,k) = tau
SLIDE 31
GPU kernel: Block Householder QR
The GPU kernel thread-level parallelism z = -tau * v’ * A (k:m,:) A (k:m,k:n) = A (k:m,k:n) + v * z (k:n) T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1)’ T (k,k) = tau A is 96-by-32, in register during factorization, and finally in global memory at the end (V and R) each thread owns an 8-by-1 “bitty block” of A v is 96-by-1, in shared memory z is 32-by-1, in shared. Requires 12-by-32 shared space for v’*A(k:m,:) reduction
SLIDE 32
GPU kernel: Block Householder QR
Putting it all together. At the kth step: threads that own column k write A to shared thread zero computes Householder coefficients z = -tau * v’ * A (k:m,:)
each thread computes 8-by-1 dot product in parallel writes scalar result in 12-by-32 reduction space warp zero sums reduction space to get z
A (k:m,k:n) = A (k:m,k:n) + v * z (k:n)
- nly done by threads that own columns k:n
threads that own column k+1 compute norm of that column of A, for next Householder coef, saving result in 12-by-1 reduction space
T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1)’
- nly done by threads 1:k-1
thread zero sums up reduction space for norm of column k+1
SLIDE 33
GPU kernel: Block Householder QR
z = -tau * v’ * A (k:m,:) each thread computes 8-by-1 dot product in parallel writes scalar result in 12-by-32 reduction space warp zero sums reduction space to get z
SLIDE 34
GPU kernel: Block Householder QR
A (k:m,k:n) = A (k:m,k:n) + v * z (k:n)
- nly done by threads that
- wn columns k:n
threads that own column k+1 compute norm of that column of A, for next Householder coef, saving result in 12-by-1 reduction space
SLIDE 35
GPU kernel: Block Householder QR
T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1)’
- nly done by threads 1:k-1
SLIDE 36
Single GPU performance results
Putting it all together ... Performance results Fermi K20 K40 GPU kernels: apply block Householder 183 Gflops 260 Gflops 360 Gflops factorize 3 tiles 27 Gflops 20 Gflops dense QR for large front 107 Gflops 120 Gflops (’bare metal’ flops) 154 Gflops 172 Gflops sparse QR on GPU 80 Gflops 150 Gflops peak speedup over CPU 11x 20x typical speedup over CPU 5x 10x
SLIDE 37
GPU Kernel pitfalls
What we’ll do differently in our kernel design Householder block-apply using too much shared memory uberkernel approach
each thread block determines what to do from a task list (QR, apply, assemble) pros: single large kernel launch, lots of parallelism con: all tasks use same thread geometry con: QR of panel needs higher occupancy to hide scalar
- (d)
to do: block apply kernel needs to stage A = A − WYA by not keeping W and Y in shared. split the uberkernel so QR panel can have higher occupancy
SLIDE 38
Single GPU performance on many matrices
SLIDE 39
Multiple GPUs: decomposing the tree
SLIDE 40
Multiple GPUs: Performance Results
Results on NVIDIA Tesla C2070 GPUs problem CPU 1 GPU 2 GPU 4 GPU speedup speedup GFlop GFlop GFlop GFlop vs CPU vs 1 GPU 1500:2D 6.1 16.0 27.1 38.4 6.3 2.4 2000:2D 6.9 21.0 37.8 56.7 8.2 2.7 3000:2D 7.8 25.8 44.8 73.7 9.4 2.9 lp nug20 23.9 74.3 86.4 66.1 2.8 0.9 ch7-8-b3 25.3 104.0 111.3 173.7 6.9 1.7 ch7-8-b3:8 10.0 88.0 160.4 286.2 28.6 3.3
SLIDE 41
Multiple GPUs: for large fronts
SLIDE 42
Multiple GPUs: bucket scheduler on the large scale
SLIDE 43
Multiple GPUs: bucket scheduler on the large scale
SLIDE 44
Multiple GPUs: bucket scheduler on the large scale
SLIDE 45
Multiple GPUs: bucket scheduler on the large scale
SLIDE 46
Multiple GPUs: bucket scheduler on the large scale
SLIDE 47
Multiple GPUs: bucket scheduler on the large scale
SLIDE 48
Multiple GPUs: bucket scheduler on the large scale
SLIDE 49
Multiple GPUs: bucket scheduler on the large scale
SLIDE 50
Multiple GPUs: bucket scheduler on the large scale
SLIDE 51
Multiple GPUs: bucket scheduler on the large scale
SLIDE 52
Multiple GPUs: bucket scheduler on the large scale
SLIDE 53
Multiple GPUs: bucket scheduler on the large scale
SLIDE 54
Multiple GPUs: bucket scheduler on the large scale
SLIDE 55
Multiple GPUs: bucket scheduler on the large scale
SLIDE 56
Multiple GPUs: bucket scheduler on the large scale
SLIDE 57
Multiple GPUs: bucket scheduler on the large scale
SLIDE 58
Multiple GPUs: bucket scheduler on the large scale
SLIDE 59
Multiple GPUs: bucket scheduler on the large scale
SLIDE 60
Multiple GPUs: bucket scheduler on the large scale
SLIDE 61
Multiple GPUs: bucket scheduler on the large scale
SLIDE 62
Multiple GPUs: bucket scheduler on the large scale
SLIDE 63
Multiple GPUs: bucket scheduler on the large scale
SLIDE 64
Multiple GPUs: bucket scheduler on the large scale
SLIDE 65
Multiple GPUs: bucket scheduler on the large scale
SLIDE 66
Multiple GPUs: bucket scheduler on the large scale
SLIDE 67
SLIDE 68
SLIDE 69
SLIDE 70
Acknowledgements: National Science Foundation NVIDIA Texas A&M University
SLIDE 71