[PPT] - Symmetric dense matrix tridiagonalization on a GPU cluster Ichitaro PowerPoint Presentation

SLIDE 1

Symmetric dense matrix tridiagonalization

n a GPU cluster

Ichitaro Yamazaki, Tim Dong, Stan Tomov, Jack Dongarra Inovative Computing Lab. University of Tennessee, Knoxville Accelerators and Hybrid Exascale Systems (AsHES) Workshop Boston, Massachusetts, 05/20/2013

Symmetric dense tridiag. on a GPU cluster 1/18

SLIDE 2

Introduction

◮ Objective: reduces a matrix A into a tridiagonal matrix T,

where A is dense (aij = 0) and symmetric (A = AT ), through QTAQ = T,

where Q is orthogonal. ◮ Motivation: often a bottleneck in solving symmetric dense eigenvalue problem: Av = λv

r

Av = λBv.

◮ arise in many scientific and engineering simulations: e.g.,

electronics calculation, quantum physics, image processing, web analysis, etc.

Symmetric dense tridiag. on a GPU cluster 2/18

SLIDE 3

MAGMA (Matrix Algebra on GPU and Multicore Architecture)

multi-GPU SYTRD integrated in > dense and sparse eigensolvers

8GPU results are from R. Solc´ a of ETH.

> electronic structure calculation

(e.g., Exciting, Elk, Abinit, QuantumEspress).

> optimized GPU kernels

with 1DBC on multiple GPUs.

multi-GPU dense symmetric solver -

1 2 3 100 200 300 400 500 600 700 800 Number of GPUs Time (s) DORMTR DSTEDC DSYTRD

multi-GPU dense symmetric generalized solver -

1 2 3 4 5 6 7 8 20 40 60 80 100 120 Number of GPUs Time (s) ZPOTRF ZHEGST ZHEEVD ZTRSMM

distributed sparse Lanczos solver -

2 4 8 50 100 150 Number of nodes Time (m) Total solution time LAPACK MAGMA

Symmetric dense tridiag. on a GPU cluster 3/18

SLIDE 4

Extension to distributed GPUs

Motivation: solving larger-scale problems

◮ not easy to develop an out-of-GPU-memory version

◮ whole trailing submatrix accessed to reduce each column ◮ problem size limitted by GPU memory

◮ weak-scaling studies on tens of GPUs or nodes.

Our first step: ScaLAPACK with GPUs

◮ any number of MPIs/GPUs per node, but one MPI dispatch GPU kernels.

larger GPU kernel and smaller communication

◮ 1DBC and MPI mapped to cores in a round-robing among nodes.

our GPU-kernels recycled

◮ same optimization techniques

(e.g., static schedule, overlapping CPU with GPU and MPI-comm of vectors).

1 2 3 4 5 1 2 3 4 5 1 2 MPI id: GPU id: 1,0 1,1 1,0 1,1 0,0 1,0 1,1 0,0 1,0 MPI−2 MPI−4 MPI−0 GPU−0,1 GPU−0,0 MPI−1 MPI−3 MPI−5 GPU−1,0 GPU−1,1 Node−0 Node−1 0,0 0,1 0,1 0,0 0,1 0,1

Symmetric dense tridiag. on a GPU cluster 4/18

SLIDE 5

“Blocked” tridiagonalization algorithm step 1: step 2: panel factorization trailing submatrix update

Panel

q1, q2, q1,q2,

... ...

Q Q

T Submatrix

for each column in the panel

generate qj to reduce the column
accumurate it into a blocked update Q
reduce communication by a

block update

Symmetric dense tridiag. on a GPU cluster 5/18

SLIDE 6

computational kernels in tridiagonalization: total of 4

3 n3 + O(n2) flops

1. panel factorization: tridiagonal reduction of a panel

◮ SYMV (bandwidth-limited BLAS-2) about 50% of flops multiply with whole trailing submatrix A to reduce each column wj := Avj →

2 n2 flops ( n2+ n)/2+ n data ≈ 4 flops

data (per call) → total of 2

3 n3 flops

− → exploit greater memory bandwidth of GPUs

A

2. trailing submatrix update: blocked orthogonal update of trailing submatrix:

◮ SYR2K (more data-parallel BLAS-3) about 50% of flops symmetric rank-k update with each panel A := A − VW T − WV T →

2 n2k flops ( n2+ n)/2+ nk data ≈ 4k flops

data (per call, k = 32, 64) → total of 2

3 n3 flops

− → exploit larger computational throughput of GPUs

Symmetric dense tridiag. on a GPU cluster 6/18

SLIDE 7

breakdown of reduction time using one GPU

Keeneland: (two 8-core Intel SandyBridge CPUs + three NVIDIA M2090 GPUs)/node

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 x 10

4

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 Matrix size, n Time / Total dsytrd dsymv+dsyr2k+dlarfg dsymv+dsyr2k dsymv

◮ reduction time dominated by dsymv and dsyr2k (up to 90%), especially BLAS-2 dsymv (up to 80%).

Symmetric dense tridiag. on a GPU cluster 7/18

SLIDE 8

multi-GPU kernel 1: BLAS-3 SYR2K symmetric rank-k updates

a sequence of “small” CUBLAS calls on streams to exploit data parallelism where A is statically distributed among GPUs in 1DBC

V W W V

T T syr2k syr2k syr2k syr2k gemm gemm gemm gemm gemm gemm gemm gemm

◮ each GPU updates block columns of its local submatrix

◮ SYR2K(VI , WI ) on diagonal block, and GEMM(V T

I , WI:N) and GEMM(W T I , VI:N) on off-diagonal

◮ multiple GPU streams cyclically on block columns Symmetric dense tridiag. on a GPU cluster 8/18

SLIDE 9

performance of SYR2K on multiple GPUs (Keeneland: three NVIDIA M2090 GPUs/node).

1000 2000 3000 4000 5000 6000 7000 200 400 600 800 1000 1200 Matrix size Gflop/s Double Complex 3 GPUs 2 GPUs 1 GPU MKL 2000 4000 6000 8000 10000 200 400 600 800 1000 1200 Matrix size Gflop/s Double Real 3 GPUs 2 GPUs 1 GPU MKL

high data-parallelism

◮ peak double precision performance = 665Gflop/s → 40% of the peak ◮ 430 and 380 Gflops by zgemm and dgemm → 75% of gemm ◮ only about 15% of the reduction time is spent here

Symmetric dense tridiag. on a GPU cluster 9/18

SLIDE 10

multi-GPU kernel 2: BLAS-2 SYMV symmetric matrix-vector multiplication

a specialized CUDA kernel to minimize data traffic for multiply with local matrix

(an extension of our SC’11 paper).

=

thread 1 thread 2 thread 3 thread 4 thread 5 thread 6 thread 7 thread 8 t h r e a d 1 t h r e a d 2 t h r e a d 3 t h r e a d 4 t h r e a d 5 t h r e a d 6 t h r e a d 7 t h r e a d 8

◮ each GPU thread processes a block row ◮ as each block Aij is read, we multiply with Aij and AT

ij

each thread writes its partial result to its own workspace
another kernels is launched to sum the partial results
A is read only once from the GPU memory

Symmetric dense tridiag. on a GPU cluster 10/18

SLIDE 11

performance of SYMV on multiple GPUs (Keeneland: three NVIDIA M2090 GPUs/node).

5000 10000 15000 20 40 60 80 100 120 140 160 180 200 Matrix size, n Gflop/s Double Complex 3 GPUs 2 GPUs 1 GPU CUBLAS 0.5 1 1.5 2 2.5 x 10

4

20 40 60 80 100 120 140 160 180 200 Matrix size, n Gflop/s Double Real 3 GPUs 2 GPUs 1 GPU CUBLAS

bandwidth-limited operation:

◮ zhemv performs twice more operations → twice the Gflop/s ◮ 120GB/sec with ECC on → peaks are 120 and 60 Gflop/s for zhemv and dsymv → 65% − 75% of this peak → 15% − 20% of gemm ◮ up to 80% of the reduction time is spent here

Symmetric dense tridiag. on a GPU cluster 11/18

SLIDE 12

putting them together: multi-GPU tridiagonalization

1 statically distribute A in 1DBC 2 for each panel 2.1 panel factorization for each column in the panel 2.1.1

update aj with previous v and w on CPUs

2.1.2

compute Householder reflector on CPUs

2.1.3 broadcast reflector to all GPUs 2.1.4 SYMV of local submatrices on GPUs in parallel 2.1.5 copy vector vj back to CPU 2.1.6

compute wj on CPUs

2.2 trailing submatrices update 2.2.1 broadcast V and W to GPUs 2.2.2 SYR2K of local submatrix on GPUs in parallel for each GPU call, communicate: ◮

local matrix (

n2+n 2×gpus data) from GPU memory

◮

vector(s) (n or nk data) between CPU and GPU (non-blocking MPI + GPU streams) Symmetric dense tridiag. on a GPU cluster 12/18

SLIDE 13

hybrid CPU-GPU computing with asynch. MPI and GPU communication

◮ SYR2K: hide communication of V or W behind panel factorization using asynch MPI send and dedicated GPU stream ◮ SYMV: overlap CPU and GPU computation (and hide CPU-GPU communication) for multiplying with updated A (left-look within panel, right-look between panels), ( A − V W T − W V T )vj,

where A is n × n, while V and W are n × (j − 1).

Avj on GPUs, while ( V W T + W V T )vj (and updating next column) on CPUs.

a piece of a trace on 3 GPUs -

Symmetric dense tridiag. on a GPU cluster 13/18

SLIDE 14

Performance on Keeneland: ScaLAPACK

◮ Keeneland: two 8-core 2.6GHz Intel Xeon processors (SandyBridge) and three NVIDIA Tesla M2090 GPUs. ◮ weak-scaling studies: matrix size = 11, 520 × (number of nodes)

1 2 .

1 4 16 100 200 300 400 500 600 700 800 900 Number of nodes Time (s) Others PDSYMV comm PDSYMV PDSYR2K

◮ hybrid-programming performed well

1MPI/socket obtained best computation and communication performance

Symmetric dense tridiag. on a GPU cluster 14/18

SLIDE 15

Performance on Keeneland: ScaLAPACK+1GPU/node

1 4 16 100 200 300 400 500 600 700 800 900 Number of nodes Time (s) Others PDSYMV comm PDSYMV PDSYR2K 1 4 16 100 200 300 400 500 600 700 800 900 Number of nodes Time (s) Others PDSYMV comm PDSYMV PDSYR2K 1.7 1.7 4.8 2.3 1.5 3.5 2.2 0.8 1.8

◮ GPU-extension got reasonable speedups

significant speedups over 1MPI/core or 1MPI/node.
similar performance as 1MPI/socket.

Symmetric dense tridiag. on a GPU cluster 15/18

SLIDE 16

Performance on Keeneland: ScaLAPACK+3GPUs/node

1 4 16 100 200 300 400 500 600 700 800 900 Number of nodes Time (s) Others PDSYMV comm PDSYMV PDSYR2K 1 4 16 100 200 300 400 500 600 700 800 900 Number of nodes Time (s) Others PDSYMV comm PDSYMV PDSYR2K 3.9 3.8 10.8 3.8 2.5 5.9 2.8 1.1 2.3

◮ GPU-extension got reasonable speedups ◮ it is difficult to scale to large-scale,

non-blocking MPI-GPU comm. to reduce waiting time

→ MPI communication starts fo dominate.

GPU kernel to reduce computation time

→ GPU efficiency goes down on many nodes ↑ both of which may be addressed by 2DBC.

Symmetric dense tridiag. on a GPU cluster 16/18

SLIDE 17

Performance of SYMV in 2DBC vs. 1DBC

0.5 1 1.5 2 2.5 3 x 10

4

500 1000 1500 2000 2500 Matrix size Gflop/s 8−by−8 GPU−grid 1−by−64 GPU−grid 4−by−4 GPU−grid 1−by−16 GPU−grid 2−by−2 GPU−grid 1−by−4 GPU−grid 1DBC on 64 GPUs 2DBC on 4 GPUs 1DBC on 16 GPUs 2DBC on 16 GPUs 1DBC on 4 GPUs 2DBC on 64 GPUs

thread 1 thread 2 thread 3 thread 4 thread 5 thread 6 thread 7 thread 8

◮ local submatrix is closer to square with 2DBC

GPU-kernel performs better on a square matrix

◮ it is being integrated into ScaLAPACK.

Symmetric dense tridiag. on a GPU cluster 17/18

SLIDE 18

Final remarks

◮ ok speedups on a small number of nodes ◮ challenge to scale to more nodes

MPI communication starts to dominate
GPU kernel does not scale well