Efficiency of general Krylov methods on GPUs An experimental study - - PowerPoint PPT Presentation

efficiency of general krylov methods on gpus an
SMART_READER_LITE
LIVE PREVIEW

Efficiency of general Krylov methods on GPUs An experimental study - - PowerPoint PPT Presentation

6 th AsHES workshop May 26 th , 2016, Chicago, USA Efficiency of general Krylov methods on GPUs An experimental study H. Anzt, M. Kreutzer, M. Koehler, G. Wellein, J. Dongarra Piotr Luszczek Solving large sparse linear systems on GPUs


slide-1
SLIDE 1

Efficiency of general Krylov methods on GPUs – An experimental study

6th AsHES workshop May 26th, 2016, Chicago, USA

  • H. Anzt, M. Kreutzer, M. Koehler, G. Wellein, J. Dongarra

Piotr Luszczek

slide-2
SLIDE 2

2

Solving large sparse linear systems on GPUs

  • Large variety of Iterative methods
  • Krylov solvers work good for many problems
  • Efficiency depends on problem characteristics
  • eigenvalue distribution
  • diagonal dominance
  • definiteness
  • Bl

Black-Box S Scenario: Problem characteristics are not known.

http://blog.heltontool.com/category/tools/

slide-3
SLIDE 3

3

The Shotgun Approach

Barrett et al.: Al Algorithmic bombardment f for t the iterative solution of linear systems: A A po poly-it iterativ ive ap approac ach, Journal of Computational and Applied Mathematics 7 4, 1996.

Run mu multiple Kr Krylov solve vers simultaneously as po poly-iterative ve method

  • Theoretical b

benefi fits

  • benefit from the fastest convergence
  • drop solvers that break down
  • Computational benefi

fits

  • Runtime overhead small for solvers with similar structure
  • SpMM replaces SpMV to generate multiple Krylov subspaces
  • Interleaving global communication for low synchronization count
  • Enhanced fault-tolerance
  • Limitation:

: Solve vers a are r required to h have ve similar structure ( (Sp SpMV/r /reduction)

slide-4
SLIDE 4

4

Contribution

Nonzeros 104 105 106 107 Matrix count 1 2 3 4 5 6 7 8 Matrix size 103 104 105 106 Matrix count 1 2 3 4 5 6 7 8 9

  • Run different Kr

Krylov me methods s on large number of test matrices

  • Analyze with different target metrics: Conve

vergence, , Sp SpMV, R , Runtime

  • Non-symmetric test matrices from University of Florida Matrix Collection
  • 1,000 < n < 5,000,000; nnz<100,000,000
  • At least one of the considered methods converges within 2n SpMV
  • 94 non-symmetric test matrices in total
slide-5
SLIDE 5

5

  • libufg

fget

  • C - interface to access matrices at UFMC
  • Max Planck Institute for Dynamics of Complex Technical Systems
  • MA

MAGMA MA

  • Accelerator-focused linear algebra software library
  • Dense and sparse linear algebra routines, solvers, eigensolvers
  • We choose: BiCGSTAB, CGS, QMR, IDR(2), IDR(4), IDR(8)
  • University of Tennessee
  • NVID

IDIA IA K4 K40 GPU

  • 1,682 GFlop/s (double precision).
  • 12 GB; 288 GB/s (theoretical) –193 GB/s (experimentally)
  • CUDA v. 7.5
  • Solve

ver s setting

  • Solve: A x = b for b≣1 starting with x ≣ 0
  • Relative residual stopping criterion: 10-10|b|

Experiment setup

slide-6
SLIDE 6

.

Solver Robustness – The Convergence Metric

B i C G S T A B C G S Q M R I D R ( 2 ) I D R ( 4 ) I D R ( 8 ) Matrix count 20 40 60 80 100

  • Convergence - fastest solver

Convergence - not fastest solver

slide-7
SLIDE 7

7

The Shotgun Approach

Barrett et al.: Al Algorithmic bombardment f for t the iterative solution of linear systems: A A po poly-it iterativ ive ap approac ach, Journal of Computational and Applied Mathematics 7 4, 1996.

Run mu multiple Kr Krylov solve vers simultaneously as po poly-iterative ve method

  • Original work: pol
  • ly-iterative

ve solve ver with Bi BiCGSTAB, , QMR, C , CGS

  • ID

IDR(s) structurally different, hard to combine in simultaneous fashion

slide-8
SLIDE 8

Solver Orthogonality w.r.t. Problem Suitability

http://www.icl.utk.edu/~hanzt/solver_ortho/

  • Which methods to include in Multi-Iterative solver?
slide-9
SLIDE 9

9

The Shotgun Approach

Barrett et al.: Al Algorithmic bombardment f for t the iterative solution of linear systems: A A po poly-it iterativ ive ap approac ach, Journal of Computational and Applied Mathematics 7 4, 1996.

Run mu multiple Kr Krylov solve vers simultaneously as po poly-iterative ve method

  • Original work: pol
  • ly-iterative

ve solve ver with Bi BiCGSTAB, , QMR, C , CGS

  • ID

IDR(s) structurally different, hard to combine in simultaneous fashion

  • po

poly-iterative ve solve ver converges in 63 of 94 test cases (67%)

  • IDR(2) converges for 60 of 94 test cases (64%)
  • IDR(4) converges for 67 of 94 test cases (71%)
  • IDR(8) converges for 91 of 94 test cases (96%)
slide-10
SLIDE 10

10

Performance– SpMV count and Runtime

  • SpMV count indicative for performance when using preconditioners
  • ID

IDR(8) wins most cases in SpMV metric

Target metric SpMV Runtime % of test matrices 10 20 30 40 50 60 70 80 90 100 IDR(8) IDR(4) IDR(2) QMR CGS BiCGSTAB

slide-11
SLIDE 11

11

The Price of Robustness

  • ID

IDR(8) solves many s systems – but often t there i is a a fa faster solve ver

  • Normalize e

execution times for each matrix to fastest solver

Test matrix 5 10 15 20 25 30 Runtime overhead 100 101 BiCGSTAB CGS QMR IDR2 IDR4 IDR8

slide-12
SLIDE 12

12

The Price of Robustness

  • ID

IDR(8) solves many s systems – but often t there i is a a fa faster solve ver

  • Normalize e

execution times for each matrix to fastest solver

  • Take ave

verage over all conve verging c confi figurations

BiCGSTAB CGS QMR IDR(2) IDR(4) IDR(8) Runtime relative to fastest method 0.5 1 1.5 2 2.5

slide-13
SLIDE 13

13

Su Summary

  • ID

IDR(s) is in a very robust solver.

  • Robustness increases with shadow space dimension s.
  • ID

IDR(8) ) solves 91 of 94 test problems (96% success).

  • For converging combinations, CGS, M

, MQR, , or Bi BiCGSTAB often fa faster.

  • On average, ID

IDR(8) le less than twice slo lower r than the fastest method.

  • Relate solve

ver s success to the pr problem origins.

  • Enhance solvers with pr

preconditioning.

  • Target other ar

archit itectures (Xeon Phi, low-power & embedded devices).

The authors would like to acknowledge support from the U.S. Department of Energy, the German Research Foundation (DFG) through the Priority Program 1648, and NVIDIA. The authors would also like to thank Daniel B. Szyld for sharing his knowledge of Krylov methods.

Future w work