Efficiency of general Krylov methods on GPUs An experimental study - - PowerPoint PPT Presentation

▶

Dec 08, 2022 252 likes •392 views

6 th AsHES workshop May 26 th , 2016, Chicago, USA Efficiency of general Krylov methods on GPUs An experimental study H. Anzt, M. Kreutzer, M. Koehler, G. Wellein, J. Dongarra Piotr Luszczek Solving large sparse linear systems on GPUs

SLIDE 1

Efficiency of general Krylov methods on GPUs – An experimental study

6th AsHES workshop May 26th, 2016, Chicago, USA

H. Anzt, M. Kreutzer, M. Koehler, G. Wellein, J. Dongarra

Piotr Luszczek

SLIDE 2

Solving large sparse linear systems on GPUs

Large variety of Iterative methods
Krylov solvers work good for many problems
Efficiency depends on problem characteristics
eigenvalue distribution
diagonal dominance
definiteness
Bl

Black-Box S Scenario: Problem characteristics are not known.

http://blog.heltontool.com/category/tools/

SLIDE 3

The Shotgun Approach

Barrett et al.: Al Algorithmic bombardment f for t the iterative solution of linear systems: A A po poly-it iterativ ive ap approac ach, Journal of Computational and Applied Mathematics 7 4, 1996.

Run mu multiple Kr Krylov solve vers simultaneously as po poly-iterative ve method

Theoretical b

benefi fits

benefit from the fastest convergence
drop solvers that break down
Computational benefi

fits

Runtime overhead small for solvers with similar structure
SpMM replaces SpMV to generate multiple Krylov subspaces
Interleaving global communication for low synchronization count
Enhanced fault-tolerance
Limitation:

: Solve vers a are r required to h have ve similar structure ( (Sp SpMV/r /reduction)

SLIDE 4

Contribution

Nonzeros 104 105 106 107 Matrix count 1 2 3 4 5 6 7 8 Matrix size 103 104 105 106 Matrix count 1 2 3 4 5 6 7 8 9

Run different Kr

Krylov me methods s on large number of test matrices

Analyze with different target metrics: Conve

vergence, , Sp SpMV, R , Runtime

Non-symmetric test matrices from University of Florida Matrix Collection
1,000 < n < 5,000,000; nnz<100,000,000
At least one of the considered methods converges within 2n SpMV
94 non-symmetric test matrices in total

SLIDE 5

libufg

fget

C - interface to access matrices at UFMC
Max Planck Institute for Dynamics of Complex Technical Systems
MA

MAGMA MA

Accelerator-focused linear algebra software library
Dense and sparse linear algebra routines, solvers, eigensolvers
We choose: BiCGSTAB, CGS, QMR, IDR(2), IDR(4), IDR(8)
University of Tennessee
NVID

IDIA IA K4 K40 GPU

1,682 GFlop/s (double precision).
12 GB; 288 GB/s (theoretical) –193 GB/s (experimentally)
CUDA v. 7.5
Solve

ver s setting

Solve: A x = b for b≣1 starting with x ≣ 0
Relative residual stopping criterion: 10-10|b|

Experiment setup

SLIDE 6

.

Solver Robustness – The Convergence Metric

B i C G S T A B C G S Q M R I D R ( 2 ) I D R ( 4 ) I D R ( 8 ) Matrix count 20 40 60 80 100

Convergence - fastest solver

Convergence - not fastest solver

SLIDE 7

The Shotgun Approach

Barrett et al.: Al Algorithmic bombardment f for t the iterative solution of linear systems: A A po poly-it iterativ ive ap approac ach, Journal of Computational and Applied Mathematics 7 4, 1996.

Run mu multiple Kr Krylov solve vers simultaneously as po poly-iterative ve method

Original work: pol
ly-iterative

ve solve ver with Bi BiCGSTAB, , QMR, C , CGS

IDR(s) structurally different, hard to combine in simultaneous fashion

SLIDE 8

Solver Orthogonality w.r.t. Problem Suitability

http://www.icl.utk.edu/~hanzt/solver_ortho/

Which methods to include in Multi-Iterative solver?

SLIDE 9

The Shotgun Approach

Barrett et al.: Al Algorithmic bombardment f for t the iterative solution of linear systems: A A po poly-it iterativ ive ap approac ach, Journal of Computational and Applied Mathematics 7 4, 1996.

Run mu multiple Kr Krylov solve vers simultaneously as po poly-iterative ve method

Original work: pol
ly-iterative

ve solve ver with Bi BiCGSTAB, , QMR, C , CGS

IDR(s) structurally different, hard to combine in simultaneous fashion

poly-iterative ve solve ver converges in 63 of 94 test cases (67%)

IDR(2) converges for 60 of 94 test cases (64%)
IDR(4) converges for 67 of 94 test cases (71%)
IDR(8) converges for 91 of 94 test cases (96%)

SLIDE 10

Performance– SpMV count and Runtime

SpMV count indicative for performance when using preconditioners
ID

IDR(8) wins most cases in SpMV metric

Target metric SpMV Runtime % of test matrices 10 20 30 40 50 60 70 80 90 100 IDR(8) IDR(4) IDR(2) QMR CGS BiCGSTAB

SLIDE 11

The Price of Robustness

IDR(8) solves many s systems – but often t there i is a a fa faster solve ver

Normalize e

execution times for each matrix to fastest solver

Test matrix 5 10 15 20 25 30 Runtime overhead 100 101 BiCGSTAB CGS QMR IDR2 IDR4 IDR8

SLIDE 12

The Price of Robustness

IDR(8) solves many s systems – but often t there i is a a fa faster solve ver

Normalize e

execution times for each matrix to fastest solver

Take ave

verage over all conve verging c confi figurations

BiCGSTAB CGS QMR IDR(2) IDR(4) IDR(8) Runtime relative to fastest method 0.5 1 1.5 2 2.5

SLIDE 13

Su Summary

IDR(s) is in a very robust solver.

Robustness increases with shadow space dimension s.
ID

IDR(8) ) solves 91 of 94 test problems (96% success).

For converging combinations, CGS, M

, MQR, , or Bi BiCGSTAB often fa faster.

On average, ID

IDR(8) le less than twice slo lower r than the fastest method.

Relate solve

ver s success to the pr problem origins.

Enhance solvers with pr

preconditioning.

Target other ar

archit itectures (Xeon Phi, low-power & embedded devices).

The authors would like to acknowledge support from the U.S. Department of Energy, the German Research Foundation (DFG) through the Priority Program 1648, and NVIDIA. The authors would also like to thank Daniel B. Szyld for sharing his knowledge of Krylov methods.

Efficiency of general Krylov methods on GPUs – An experimental study

6th AsHES workshop May 26th, 2016, Chicago, USA

Piotr Luszczek

Solving large sparse linear systems on GPUs

Black-Box S Scenario: Problem characteristics are not known.

The Shotgun Approach

Barrett et al.: Al Algorithmic bombardment f for t the iterative solution of linear systems: A A po poly-it iterativ ive ap approac ach, Journal of Computational and Applied Mathematics 7 4, 1996.

Run mu multiple Kr Krylov solve vers simultaneously as po poly-iterative ve method

benefi fits

fits

: Solve vers a are r required to h have ve similar structure ( (Sp SpMV/r /reduction)

Contribution

Krylov me methods s on large number of test matrices

vergence, , Sp SpMV, R , Runtime

fget

MAGMA MA

IDIA IA K4 K40 GPU

ver s setting

Experiment setup

.

Solver Robustness – The Convergence Metric

The Shotgun Approach

Barrett et al.: Al Algorithmic bombardment f for t the iterative solution of linear systems: A A po poly-it iterativ ive ap approac ach, Journal of Computational and Applied Mathematics 7 4, 1996.

Run mu multiple Kr Krylov solve vers simultaneously as po poly-iterative ve method

ve solve ver with Bi BiCGSTAB, , QMR, C , CGS

IDR(s) structurally different, hard to combine in simultaneous fashion

Solver Orthogonality w.r.t. Problem Suitability

http://www.icl.utk.edu/~hanzt/solver_ortho/

The Shotgun Approach

Barrett et al.: Al Algorithmic bombardment f for t the iterative solution of linear systems: A A po poly-it iterativ ive ap approac ach, Journal of Computational and Applied Mathematics 7 4, 1996.

Run mu multiple Kr Krylov solve vers simultaneously as po poly-iterative ve method

ve solve ver with Bi BiCGSTAB, , QMR, C , CGS

IDR(s) structurally different, hard to combine in simultaneous fashion

poly-iterative ve solve ver converges in 63 of 94 test cases (67%)

Performance– SpMV count and Runtime

IDR(8) wins most cases in SpMV metric

The Price of Robustness

IDR(8) solves many s systems – but often t there i is a a fa faster solve ver

execution times for each matrix to fastest solver

The Price of Robustness

IDR(8) solves many s systems – but often t there i is a a fa faster solve ver

execution times for each matrix to fastest solver

verage over all conve verging c confi figurations

Su Summary

IDR(s) is in a very robust solver.

IDR(8) ) solves 91 of 94 test problems (96% success).

, MQR, , or Bi BiCGSTAB often fa faster.

IDR(8) le less than twice slo lower r than the fastest method.

ver s success to the pr problem origins.

preconditioning.

archit itectures (Xeon Phi, low-power & embedded devices).

The authors would like to acknowledge support from the U.S. Department of Energy, the German Research Foundation (DFG) through the Priority Program 1648, and NVIDIA. The authors would also like to thank Daniel B. Szyld for sharing his knowledge of Krylov methods.

Future w work