Im Improving P Performance o of It Iterative Me Methods by y - - PowerPoint PPT Presentation

im improving p performance o of it iterative me methods
SMART_READER_LITE
LIVE PREVIEW

Im Improving P Performance o of It Iterative Me Methods by y - - PowerPoint PPT Presentation

Im Improving P Performance o of It Iterative Me Methods by y Lo Lossy y Checkp kpoin intin ing Dingwen Tao (University of California, Riverside) Sheng Di (Argonne National Laboratory) Xin Liang (University of California, Riverside)


slide-1
SLIDE 1

Im Improving P Performance o

  • f It

Iterative Me Methods by y Lo Lossy y Checkp kpoin intin ing

Dingwen Tao (University of California, Riverside) Sheng Di (Argonne National Laboratory) Xin Liang (University of California, Riverside) Zizhong Chen (University of California, Riverside) Franck Cappello (Argonne National Laboratory)

June 2018

slide-2
SLIDE 2

Ou Outline

ØIntroduction

  • Why we need to checkpoint iterative methods?

2

slide-3
SLIDE 3

Ou Outline

ØIntroduction

  • Why we need to checkpoint iterative methods?

ØBackground

  • Traditional checkpointing for iterative methods
  • Performance model of traditional checkpointing

3

slide-4
SLIDE 4

Ou Outline

ØIntroduction

  • Why we need to checkpoint iterative methods?

ØBackground

  • Traditional checkpointing for iterative methods
  • Performance model of traditional checkpointing

ØOur Designs

  • Lossy checkpointing for iterative methods
  • Performance model of our new checkpointing

4

slide-5
SLIDE 5

Ou Outline

ØIntroduction

  • Why we need to checkpoint iterative methods?

ØBackground

  • Traditional checkpointing for iterative methods
  • Performance model of traditional checkpointing

ØOur Designs

  • Lossy checkpointing for iterative methods
  • Performance model of our new checkpointing

ØTheoretical Analysis

  • Impact of lossy checkpointing for different methods
  • Expected fault tolerance overhead

5

slide-6
SLIDE 6

Ou Outline

ØIntroduction

  • Why we need to checkpoint iterative methods?

ØBackground

  • Traditional checkpointing for iterative methods
  • Performance model of traditional checkpointing

ØOur Designs

  • Lossy checkpointing for iterative methods
  • Performance model of our new checkpointing

ØTheoretical Analysis

  • Impact of lossy checkpointing for different methods
  • Expected fault tolerance overhead

ØExperimental Evaluation

6

slide-7
SLIDE 7

Ou Outline

ØIntroduction

  • Why we need to checkpoint iterative methods?

ØBackground

  • Traditional checkpointing for iterative methods
  • Performance model of traditional checkpointing

ØOur Designs

  • Lossy checkpointing for iterative methods
  • Performance model of our new checkpointing

ØTheoretical Analysis

  • Impact of lossy checkpointing for different methods
  • Expected fault tolerance overhead

ØExperimental Evaluation

7

slide-8
SLIDE 8

Wh Why Ne Need to

  • Checkpoi
  • int Iterative Method
  • ds?

ØIterative methods used for solving large, sparse linear system

  • ”Gaia” mission by European Space Agency (ESA)
  • Producing 5-parameter astrometric catalogue at the microarcsecond for

1 billion stars in Galaxy

  • Resulting a very large, sparse linear system of 72 billion equations
  • Scientists use LSQR iterative algorithm
  • Takes more than 54 hours on 2,048 BlueGene/Q nodes

8

  • Largest symmetric indefinite sparse matrix from UFL sparse

matrix collection (KKT240 with 28 million linear equations)

  • 2,048 cores / 64 nodes on Bebop cluster at Argonne
  • GMRES solver implemented in PETSc
  • Relative convergence tolerance of 10-6 , execution time > 1 hour
  • MTBF of Sunway TaihuLight supercomputer can be hourly or less

than 1 hour

4E+05 5E+05 6E+05 7E+05 5000 10000 15000 20000 256 512 1024 2048

Seconds Number of Processes Execution Time Number of Iterations

slide-9
SLIDE 9

Im Importan ance of f Im Improvi ving Checkpointing Pe Performance of Iterative Methods

ØScientific simulations involving PDEs

  • Solve linear systems within each timestep
  • Sparse linear systems include most of the variables
  • E.g., 3D CFD problems from Navier-Stokes equations
  • Semi-Implicit Method for Pressure-Linked Equations (SIMPLE) algorithm
  • 5 out of 9 fluid-flow variables need to be checkpointed in iterative method

ØSignificantly Improve Checkpointing Performance of Iterative methods

Significantly Improve Application Performance

9

slide-10
SLIDE 10

Comp Proc 1 Comp Proc k

Process State P1 Process State Pk

Stable Storage

St State-of

  • f-the

the-Ar Art: F Failure-St Stop Failure

Checkpoint/Restart Model

  • Periodical checkpoint to file system is expensive
  • Difficult to scale up due to bottleneck of I/O bandwidth

10

slide-11
SLIDE 11

St State-of

  • f-the

the-Ar Art: F Failure-St Stop Failure

Diskless checkpoint (J. Plank)

  • More scalable (pros)
  • 2X or more memory overhead (cons) à Reduce usable memory and problem size
  • Only able to tolerate with partial failures , not for a whole system failure (cons)
  • Requires spare nodes and dedicates processors (cons)

Comp Proc 1 Comp Proc k Ckpt Proc Process State P1 Local Checkpoint C1 Process State Pk Local Checkpoint Ck Checkpoint Encoding C XOR

Stable Storage

C1 + . . . + Cn = C

11

2 steps: 1. Checkpointing state of each application processor in memory 2. Encoding these in-memory checkpoints and storing the encodings in checkpointing processors

slide-12
SLIDE 12

Fa Failures and Checkpointing

12

Optimized Techniques to Improve Scalability of Checkpoint

  • Diskless checkpoint
  • Multi-level checkpoint
  • Asynchronized checkpoint
  • Lossless-compressed checkpoint
  • ……

Question: Can we use lossy compression to (1) reduce checkpointing size and overhead and (2) improve the performance and scalability?

slide-13
SLIDE 13

Fa Failures and Checkpointing

13

Question: Can we use lossy compression to (1) reduce checkpointing size and overhead and (2) improve the performance and scalability?

Lossy checkpointing

Two important questions: (1) What is the impact of the lossy checkpointing data on the execution performance? (2) Can lossy checkpointing actually improve the overall performance (including C/R and lossy compression) in the context of restarting with alternated data?

slide-14
SLIDE 14

Ou Outline

ØIntroduction

  • Why we need to checkpoint iterative methods?

ØBackground

  • Traditional checkpointing for iterative methods
  • Performance model of traditional checkpointing

ØOur Designs

  • Lossy checkpointing for iterative methods
  • Performance model of our new checkpointing

ØTheoretical Analysis

  • Impact of lossy checkpointing for different methods
  • Expected fault tolerance overhead

ØExperimental Evaluation

14

slide-15
SLIDE 15

Tr Traditional Checkpointing for Iterative Methods

ØCheckpoint

1.

Checkpoint static variables (e.g., A, M) at the beginning

2.

Checkpoint dynamic variables (e.g., i, ⍴, p, x) every several iterations

15

slide-16
SLIDE 16

Tr Traditional Checkpointing for Iterative Methods

ØCheckpoint

1.

Checkpoint static variables (e.g., A, M) at the beginning

2.

Checkpoint dynamic variables (e.g., i, ⍴, p, x) every several iterations

ØRecovery

1.

Recover a correct computational environment

2.

Recover static variables

3.

Recover dynamic variables

4.

Recover recomputed variables (e.g., r)

16

slide-17
SLIDE 17

Tr Traditional Checkpointing for Iterative Methods

ØCheckpoint

1.

Checkpoint static variables (e.g., A, M) at the beginning

2.

Checkpoint dynamic variables (e.g., i, ⍴, p, x) every several iterations

ØRecovery

1.

Recover a correct computational environment

2.

Recover static variables

3.

Recover dynamic variables

4.

Recover recomputed variables (e.g., r)

ØC/R cost dominated by dynamic variables

  • Static variables not checkpointed along iterations (at most once)
  • Static variables: linear system matrix A and preconditioner M
  • A usually has 1x ~ 10x nnz than dynamic variables’ size (i.e., vector size)
  • M is much sparse than A, e.g., block Jacobi, ILU
  • Checkpoint frequency is usually much higher than failure rate
  • MTTI = 4 hrs., Timeckpt = 18 s è Checkpoint interval (Young’ formula) = 12 mins
  • Checkpoint frequency is 30x higher than recovery frequency

17

slide-18
SLIDE 18

Tr Traditional Checkpointing for Iterative Methods

ØCheckpoint

1.

Checkpoint static variables (e.g., A, M) at the beginning

2.

Checkpoint dynamic variables (e.g., i, ⍴, p, x) every several iterations

ØRecovery

1.

Recover a correct computational environment

2.

Recover static variables

3.

Recover dynamic variables

4.

Recover recomputed variables (e.g., r)

ØC/R cost dominated by dynamic variables

  • Static variables not checkpointed along iterations (at most once)
  • Static variables: linear system matrix A and preconditioner M
  • A usually has 1x ~ 10x nnz than dynamic variables’ size (i.e., vector size)
  • M is much sparse than A, e.g., block Jacobi, ILU
  • Checkpoint frequency is usually much higher than failure rate
  • MTTI = 4 hrs., Timeckpt = 18 s è Checkpoint interval (Young’ formula) = 12 mins
  • Checkpoint frequency is 30x higher than recovery frequency

18

Focus on reducing C/R overhead

  • f dynamic variables in iterative

methods by lossy compressors.

slide-19
SLIDE 19

Ou Outline

ØIntroduction

  • Why we need to checkpoint iterative methods?

ØBackground

  • Traditional checkpointing for iterative methods
  • Performance model of traditional checkpointing

ØOur Designs

  • Lossy checkpointing for iterative methods
  • Performance model of our new checkpointing

ØTheoretical Analysis

  • Impact of lossy checkpointing for different methods
  • Expected fault tolerance overhead

ØExperimental Evaluation

19

slide-20
SLIDE 20

Th Theoretic ical al Analy alysis is of f Checkpoin intin ing Ov Overhead for

  • r Iterative Method
  • ds
  • Overall execution time

Iteration time Checkpoint time Recover/rollback time

20

slide-21
SLIDE 21

Th Theoretic ical al Analy alysis is of f Checkpoin intin ing Ov Overhead for

  • r Iterative Method
  • ds
  • Overall execution time

Iteration time Checkpoint time Recover/rollback time

21

!"# = %!&'/2

  • Based on Young’s formula

Expected mean time of a rollback and

  • Overall time can be simplified to
slide-22
SLIDE 22

Th Theoretic ical al Analy alysis is of f Checkpoin intin ing Ov Overhead for

  • r Iterative Method
  • ds
  • Overall execution time

Iteration time Checkpoint time Recover/rollback time

22

!"# = %!&'/2

  • Based on Young’s formula

Expected mean time of a rollback and

  • Overall time can be simplified to
  • Fault tolerance overhead
  • Fault tolerance overhead (%)

(assume )

!*+,~!

"*

slide-23
SLIDE 23

Th Theoretic ical al Analy alysis is of f Checkpoin intin ing Ov Overhead for

  • r Iterative Method
  • ds
  • Overall execution time

Iteration time Checkpoint time Recover/rollback time

23

!"# = %!&'/2

  • Based on Young’s formula

Expected mean time of a rollback and

  • Overall time can be simplified to
  • Fault tolerance overhead
  • Fault tolerance overhead (%)
slide-24
SLIDE 24

Th Theoretic ical al Analy alysis is of f Checkpoin intin ing Ov Overhead for

  • r Iterative Method
  • ds
  • Overall execution time

Iteration time Checkpoint time Recover/rollback time

24

!"# = %!&'/2

  • Based on Young’s formula

Expected mean time of a rollback and

  • Overall time can be simplified to
  • Fault tolerance overhead (%)
  • For example, MTTI is 1 hour (λ = 2.7x10-4)
  • Tckpt = 120 s, expected FT overhead ~ 40%
  • Checkpoint x (GMRES) on 64 nodes (2,048 cores) on Bebop at ANL
slide-25
SLIDE 25

Th Theoretic ical al Analy alysis is of f Checkpoin intin ing Ov Overhead for

  • r Iterative Method
  • ds
  • Overall execution time

Iteration time Checkpoint time Recover/rollback time

25

!"# = %!&'/2

  • Based on Young’s formula

Expected mean time of a rollback and

  • Overall time can be simplified to
  • Fault tolerance overhead (%)
  • For example, MTTI is 1 hour (λ = 2.7x10-4)
  • Tckpt = 120 s, expected FT overhead ~ 40%
  • Checkpoint x (GMRES) on 64 nodes (2,048 cores) on Bebop at ANL
  • Tckpt = 25 s, expected FT overhead ~ 14% (significantly reduced!)
slide-26
SLIDE 26

Ou Outline

ØIntroduction

  • Why we need to checkpoint iterative methods?

ØBackground

  • Traditional checkpointing for iterative methods
  • Performance model of traditional checkpointing

ØOur Designs

  • Lossy checkpointing for iterative methods
  • Performance model of our new checkpointing

ØTheoretical Analysis

  • Impact of lossy checkpointing for different methods
  • Expected fault tolerance overhead

ØExperimental Evaluation

26

slide-27
SLIDE 27

Lo Lossy Checkpointi ting Scheme for Iterati tive Meth thods

ØLossy checkpointing scheme for iterative methods has two steps

  • Compress dynamic variables with lossy compressor before each

checkpointing

  • Decompress compressed dynamic variables after each recovering

27

slide-28
SLIDE 28

Lo Lossy Checkpointi ting Scheme for Iterati tive Meth thods

ØLossy checkpointing scheme for iterative methods has two steps

  • Compress dynamic variables with lossy compressor before each

checkpointing

  • Decompress compressed dynamic variables after each recovering

ØOrthogonality dependent iterative methods

  • For example, CG maintains a series of orthogonality relations
  • p(k) and Aq(j), r(k) and p(j), r(k) and r(j) for any j < k
  • CG’s superlinear convergence relies on these orthogonality
  • CG after lossy checkpointing may lose superlinear convergence

28

slide-29
SLIDE 29

Lo Lossy Checkpointi ting Scheme for Iterati tive Meth thods

ØLossy checkpointing scheme for iterative methods has two steps

  • Compress dynamic variables with lossy compressor before each

checkpointing

  • Decompress compressed dynamic variables after each recovering

ØOrthogonality dependent iterative methods

  • For example, CG maintains a series of orthogonality relations
  • p(k) and Aq(j), r(k) and p(j), r(k) and r(j) for any j < k
  • CG’s superlinear convergence relies on these orthogonality
  • CG after lossy checkpointing may lose superlinear convergence

ØRestarted scheme

  • Periodically treat current approximate solution as new initial guess
  • Advantages
  • Less time and space complexity, such as GMRES ~ O(N2), where N is time step
  • Restarted scheme may not delay but even accelerate the convergence (jump
  • ut of local search)

29

slide-30
SLIDE 30

Lo Lossy Checkpointi ting Scheme for Iterati tive Meth thods

ØLossy checkpointing scheme for iterative methods has two steps

  • Compress dynamic variables with lossy compressor before each

checkpointing

  • Decompress compressed dynamic variables after each recovering

ØRestarted scheme

  • Periodically treat current approximate solution as new initial guess
  • Advantages
  • Less time and space complexity, such as GMRES ~ O(N2), where N is time step
  • Restarted scheme may not delay but even accelerate the convergence (jump out of

local search)

ØLossy checkpointing with restarted scheme

  • Checkpoint only approximate solution xi
  • Lossy decompressed xi as new initial guess
  • Reconstruct orthogonal relations and superlinear convergence

30

slide-31
SLIDE 31

Ou Outline

ØIntroduction

  • Why we need to checkpoint iterative methods?

ØBackground

  • Traditional checkpointing for iterative methods
  • Performance model of traditional checkpointing

ØOur Designs

  • Lossy checkpointing for iterative methods
  • Performance model of our new checkpointing

ØTheoretical Analysis

  • Impact of lossy checkpointing for different methods
  • Expected fault tolerance overhead

ØExperimental Evaluation

31

slide-32
SLIDE 32

Pe Performance Model of Lossy Checkpointing

  • Overall execution time

Iteration time Lossy checkpoint time Recover/rollback time Mean extra iterations to convergence caused by one lossy recovery

32

slide-33
SLIDE 33

Pe Performance Model of Lossy Checkpointing

  • Overall execution time

Iteration time Lossy checkpoint time Restart/rollback time

  • Similarly, overall time can be simplified to

Mean extra iterations to convergence caused by one lossy recovery

  • Fault tolerance overhead of lossy checkpointing

33

slide-34
SLIDE 34

Th Theoretic ical al Analy alysis is of f N’ for Pe Performance Gain

34

To have the lossy checkpointing overhead lower than that

  • f traditional checkpointing: !

"#$%&$'( )"**+,- < !"#$%&$'( ,-

How to use Theorem 1?

  • For example, MTTI is 1 hour (λ = 2.7x10-4)
  • Lossy compression reduces Tckp from 120 seconds

to 25 seconds

  • Tit = 1.2 s for GMRES (7160 s with 5875 itrs)
  • Based on Theorem 1, lossy checkpointing is

worthwhile if N’ <= 500

  • If one lossy recovery causes 500 (~ 9% of total

itrs) or fewer extra itrs to converge, lossy checkpointing can improve overall performance

slide-35
SLIDE 35

Ou Outline

ØIntroduction

  • Why we need to checkpoint iterative methods?

ØBackground

  • Traditional checkpointing for iterative methods
  • Performance model of traditional checkpointing

ØOur Designs

  • Lossy checkpointing for iterative methods
  • Performance model of our new checkpointing

ØTheoretical Analysis

  • Impact of lossy checkpointing for different methods
  • Expected fault tolerance overhead

ØExperimental Evaluation

35

slide-36
SLIDE 36

36

Im Impac act Analy alysis is of f Lossy Checkpointing on It Iterativ ive Methods

  • Stationary Iterative Methods
  • Conjugate Gradient (CG) Method
  • Generalized Minimum Residual (GMRES) Method
slide-37
SLIDE 37

37

  • Stationary Iterative Methods
  • Most classic
  • Conjugate Gradient (CG) Method
  • Most popular for SPD systems
  • Generalized Minimum Residual (GMRES) Method
  • Most general (asymmetric, indefinite, …), robust

Im Impac act Analy alysis is of f Lossy Checkpointing on It Iterativ ive Methods

slide-38
SLIDE 38

Im Impac act Analy alysis is of f Lossy Checkpointing on It Iterativ ive Methods — St Stationa nary I y Iteration

38

  • Stationary iterative methods: !(#) = &!(#'() + *
  • ||!(#) − !∗|| ≈ /# 0 ||! 1 − !∗||
  • R is the spectral radius of matrix G (the largest eigenvalue of G, R < 1)
  • !∗ is the exact solution, ! 1 is the initial guess
slide-39
SLIDE 39

Im Impac act Analy alysis is of f Lossy Checkpointing on It Iterativ ive Methods — St Stationa nary I y Iteration

39

  • Stationary iterative methods: !(#) = &!(#'() + *
  • ||!(#) − !∗|| ≈ /# 0 ||! 1 − !∗||
  • R is the spectral radius of matrix G (the largest eigenvalue of G, R < 1)
  • If stationary methods encounter a failure and restarts at tth iteration
  • Lossy compression introduces an error vector e with a relative error bound 23
  • !#

4 − !5 # 4

≤ 23 0 |!#

4 | for 1 ≤ 8 ≤ 9

  • Computation restarts from alternated vector !′(4) = !(#) + 2
slide-40
SLIDE 40

Im Impac act Analy alysis is of f Lossy Checkpointing on It Iterativ ive Methods — St Stationa nary I y Iteration

40

  • Stationary iterative methods: !(#) = &!(#'() + *
  • ||!(#) − !∗|| ≈ /# 0 ||! 1 − !∗||
  • R is the spectral radius of matrix G (the largest eigenvalue of G, R < 1)
  • If stationary methods encounter a failure and restarts at tth iteration
  • Lossy compression introduces an error vector e with a relative error bound 23
  • !#

4 − !5 # 4

≤ 23 0 |!#

4 | for 1 ≤ 8 ≤ 9

  • Computation restarts from alternated vector !′(4) = !(#) + 2
  • After a series of derivations, upper bound of N’ is ; − <=>? /4 + 23 ≔ A3(;)
slide-41
SLIDE 41

Im Impac act Analy alysis is of f Lossy Checkpointing on It Iterativ ive Methods — St Stationa nary I y Iteration

41

  • Stationary iterative methods: !(#) = &!(#'() + *
  • ||!(#) − !∗|| ≈ /# 0 ||! 1 − !∗||
  • R is the spectral radius of matrix G (the largest eigenvalue of G, R < 1)
  • If stationary methods encounter a failure and restarts at tth iteration
  • Lossy compression introduces an error vector e with a relative error bound 23
  • !#

4 − !5 # 4

≤ 23 0 |!#

4 | for 1 ≤ 8 ≤ 9

  • Computation restarts from alternated vector !′(4) = !(#) + 2
  • After a series of derivations, upper bound of N’ is ; − <=>? /4 + 23 ≔ A3(;)
  • Expected upper bound of N’ falls into [CD(

E − <=>? /

FGH I + 23 , K − <=>? /C + 23 ]

  • Due to A3(;) is monotonic function, E A3(;) ≤ A3(N)
  • Due to A3(;) is convex function, E A3(;) ≥ A3(P ; ) (based on Jensen inequality)
slide-42
SLIDE 42

Im Impac act Analy alysis is of f Lossy Checkpointing on It Iterativ ive Methods — St Stationa nary I y Iteration

42

  • Stationary iterative methods: !(#) = &!(#'() + *
  • ||!(#) − !∗|| ≈ /# 0 ||! 1 − !∗||
  • R is the spectral radius of matrix G (the largest eigenvalue of G, R < 1)
  • If stationary methods encounter a failure and restarts at tth iteration
  • Lossy compression introduces an error vector e with a relative error bound 23
  • !#

4 − !5 # 4

≤ 23 0 |!#

4 | for 1 ≤ 8 ≤ 9

  • Computation restarts from alternated vector !′(4) = !(#) + 2
  • After a series of derivations, upper bound of N’ is ; − <=>? /4 + 23 ≔ A3(;)
  • Expected upper bound of N’ falls into [CD(

E − <=>? /

FGH I + 23 , K − <=>? /C + 23 ]

  • Due to A3(;) is monotonic function, E A3(;) ≤ A3(N)
  • Due to A3(;) is convex function, E A3(;) ≥ A3(P ; ) (based on Jensen inequality)
slide-43
SLIDE 43

Im Impac act Analy alysis is of f Lossy Checkpointing on It Iterativ ive Methods — GM GMRES

43

  • Not easy to analyze N’ for nonstationary methods

(unlike stationary methods)

slide-44
SLIDE 44

Im Impac act Analy alysis is of f Lossy Checkpointing on It Iterativ ive Methods — GM GMRES

44

  • Not easy to analyze N’ for nonstationary methods

(unlike stationary methods)

  • GMRES can converge to the same accuracy with no

delay or even exhibit an acceleration sometimes if restarted residual is close to previous residual

  • J. Langou, Z. Chen, G. Bosilca, and J. Dongarra. Recovery patterns

for iterative methods in a parallel unstable environment. SIAM Journal on Scientific Computing, 30(1):102–116, 2007.

slide-45
SLIDE 45

Im Impac act Analy alysis is of f Lossy Checkpointing on It Iterativ ive Methods — GM GMRES

45

  • Not easy to analyze N’ for nonstationary methods

(unlike stationary methods)

  • GMRES can converge to the same accuracy with no

delay or even exhibit an acceleration sometimes if restarted residual is close to previous residual

  • J. Langou, Z. Chen, G. Bosilca, and J. Dongarra. Recovery patterns

for iterative methods in a parallel unstable environment. SIAM Journal on Scientific Computing, 30(1):102–116, 2007.

1. GMRES is easy to stagnate in practice 2. Lossy recovered data can form a new approximate solution with different spectral properties 3. A failure happened during stagnation may help GMRES jump

  • ut of stagnation
slide-46
SLIDE 46

Im Impac act Analy alysis is of f Lossy Checkpointing on It Iterativ ive Methods — GM GMRES

46

  • Not easy to analyze N’ for nonstationary methods

(unlike stationary methods)

  • GMRES can converge to the same accuracy with no

delay or even exhibit an acceleration sometimes if restarted residual is close to previous residual

  • J. Langou, Z. Chen, G. Bosilca, and J. Dongarra. Recovery patterns

for iterative methods in a parallel unstable environment. SIAM Journal on Scientific Computing, 30(1):102–116, 2007.

1. GMRES is easy to stagnate in practice 2. Lossy recovered data can form a new approximate solution with different spectral properties 3. A failure happened during stagnation may help GMRES jump

  • ut of stagnation
  • An adaptive error bound scheme for GMRES
  • Based on Theorem 3: if eb is set to ||"($)||/‖ ‖

( , new residual norm is close to the previous residual

  • Error-bound lossy compressors (such as SZ and ZFP)

can control the distortion of data within )( * ||+($)||

slide-47
SLIDE 47

Im Impac act Analy alysis is of f Lossy Checkpointing on It Iterativ ive Methods — GM GMRES

47

  • Not easy to analyze N’ for nonstationary methods

(unlike stationary methods)

  • GMRES can converge to the same accuracy with no

delay or even exhibit an acceleration sometimes if restarted residual is close to previous residual

  • J. Langou, Z. Chen, G. Bosilca, and J. Dongarra. Recovery patterns

for iterative methods in a parallel unstable environment. SIAM Journal on Scientific Computing, 30(1):102–116, 2007.

1. GMRES is easy to stagnate in practice 2. Lossy recovered data can form a new approximate solution with different spectral properties 3. A failure happened during stagnation may help GMRES jump

  • ut of stagnation
  • An adaptive error bound scheme for GMRES
  • Based on Theorem 3: if eb is set to ||"($)||/‖ ‖

( , new residual norm is close to the previous residual

  • Error-bound lossy compressors (such as SZ and ZFP)

can control the distortion of data within )( * ||+($)|| N’ = 0 for GMRES

slide-48
SLIDE 48

Im Impac act Analy alysis is of f Lossy Checkpointing on It Iterativ ive Methods — CG CG

48

  • Extra convergence steps N’ for CG exhibit randomness (even if ensure close restarted residual)
  • We adopt empirical evaluation for N’
  • Randomly select an iteration to compress and decompress x in each execution
slide-49
SLIDE 49

Im Impac act Analy alysis is of f Lossy Checkpointing on It Iterativ ive Methods — CG CG

49

  • Extra convergence steps N’ for CG exhibit randomness (even if ensure close restarted residual)
  • We adopt empirical evaluation for N’
  • Randomly select an iteration to compress and decompress x in each execution
  • Average N’ varies from 10% to 25% with different eb

0% 5% 10% 15% 20% 25% 30% 1.0E-03 1.0E-04 1.0E-05 1.0E-06 Average Extra Itearations (%) Relative Error Bounds

slide-50
SLIDE 50

Im Impac act Analy alysis is of f Lossy Checkpointing on It Iterativ ive Methods — CG CG

50

  • Extra convergence steps N’ for CG exhibit randomness (even if ensure close restarted residual)
  • We adopt empirical evaluation for N’
  • Randomly select an iteration to compress and decompress x in each execution
  • Average N’ varies from 10% to 25% with different eb

0% 5% 10% 15% 20% 25% 30% 1.0E-03 1.0E-04 1.0E-05 1.0E-06 Average Extra Itearations (%) Relative Error Bounds

N’ = 25%·N for CG if eb = 10-4

slide-51
SLIDE 51

Pe Performance Evaluation

51

ØExperimental platform

  • 2048 cores from 64 nodes (each node with 2 Intel Xeon E5-2695 v4 processors + 128 GB memory) in

Bebop cluster at Argonne

  • I/O and storage are typical high-end supercomputer facilities

ØImplementation

  • FTI checkpointing library (v0.9.5)
  • MPI-IO mode to write checkpoint data to PFS
  • SZ lossy compression library (v1.4.12)
  • SZ has better compression performance on 1D data
  • Iterative methods implemented in PETSc (v3.8)

ØExperimental Setup

  • Jacobi for stationary methods, CG, and GMRES(30)
  • Default preconditioner (block Jacobi with ILU/IC)
  • eb = 10-4 for Jacobi and CG, adaptive eb for GMRES
  • Relative convergence tolerance of 10-4, 10-6, 10-7 for Jacobi, GMRES, CG
slide-52
SLIDE 52

Li Linear System Configurati tion

52

  • Linear system (arising from 3D Poisson)
  • 3D Poisson matrix can increase the problem size

as the scale increases

  • Weak-scaling study
  • Choose largest problem size that can be held in memory

by using 64 nodes for GMRES(30)

One vector (double precision) of size 21603 (~1010) ~ 80 GB

slide-53
SLIDE 53

Lo Lossy Checkpointi ting Performance

  • Experiment for one checkpoint/recovery performance
  • Fixed C/R frequency
  • Average time and size over the entire execution
  • Average checkpointing size
  • Lossless compression reduces checkpoint size up to 1/6
  • Lossy compression reduces checkpoint size to 1/20 ~ 1/60

53

slide-54
SLIDE 54

Lo Lossy Checkpointi ting Performance

54

Jacobi GMRES CG

slide-55
SLIDE 55

Lo Lossy Checkpointi ting Performance

55

Jacobi GMRES CG

Lossy checkpointing can significantly reduce C/R time!

slide-56
SLIDE 56

Ou Outline

ØIntroduction

  • Why we need to checkpoint iterative methods?

ØBackground

  • Traditional checkpointing for iterative methods
  • Performance model of traditional checkpointing

ØOur Designs

  • Lossy checkpointing for iterative methods
  • Performance model of our new checkpointing

ØTheoretical Analysis

  • Impact of lossy checkpointing for different methods
  • Expected fault tolerance overhead

ØExperimental Evaluation

56

slide-57
SLIDE 57

Th Theoretic ical al Performan ance Analy alysis is

57

We can analyze expected fault tolerance overhead based on our lossy checkpointing performance model

  • For Jacobi, based on Theorem 2, 5.2 ≤ %& ≤ 5.5 à N’ = 6
  • For GMRES, N’ = 0
  • For CG, N’ = 594 (25% of total iterations) based on empirical evaluation
slide-58
SLIDE 58

Th Theoretic ical al Performan ance Analy alysis is

58

We can analyze expected fault tolerance overhead based on our lossy checkpointing performance model

  • For Jacobi, based on Theorem 2, 5.2 ≤ %& ≤ 5.5 à N’ = 6
  • For GMRES, N’ = 0
  • For CG, N’ = 594 (25% of total iterations) based on empirical evaluation
slide-59
SLIDE 59

Th Theoretic ical al Performan ance Analy alysis is

59

Lossy checkpointing Lossy checkpointing

We can analyze expected fault tolerance overhead based on our lossy checkpointing performance model

  • For Jacobi, based on Theorem 2, 5.2 ≤ %& ≤ 5.5 à N’ = 6
  • For GMRES, N’ = 0
  • For CG, N’ = 594 (25% of total iterations) based on empirical evaluation
slide-60
SLIDE 60

Th Theoretic ical al Performan ance Analy alysis is

60

Lossy checkpointing Lossy checkpointing

Observations

  • GMRES and Jacobi: lossy checkpoint is always better than lossless and traditional checkpoint
  • CG: lossy checkpoint is better than lossless and traditional checkpoint when # processes > 1536 / 768
  • Curves of lossy checkpoint increase much slowly than curves of other two solutions à Our proposed lossy

checkpoint is expected to achieve more performance gain as scale increases

slide-61
SLIDE 61

Ou Outline

ØIntroduction

  • Why we need to checkpoint iterative methods?

ØBackground

  • Traditional checkpointing for iterative methods
  • Performance model of traditional checkpointing

ØOur Designs

  • Lossy checkpointing for iterative methods
  • Performance model of our new checkpointing

ØTheoretical Analysis

  • Impact of lossy checkpointing for different methods
  • Expected fault tolerance overhead

ØExperimental Evaluation

61

slide-62
SLIDE 62

Ex Expe perimental Evalua uation n with h Failur ures

62

ØFailure Injection

  • MTTI = 1 hour
  • Failure intervals follow an exponential distribution
slide-63
SLIDE 63

Ex Expe perimental Evalua uation n with h Failur ures

63

ØFailure Injection

  • MTTI = 1 hour
  • Failure intervals follow an exponential distribution

ØCheckpoint Interval

  • !"#$%&'(

)*+, ~ 120 1, !"#$%&'( 34556755 ~ 70 1, !"#$%&'( 34559 ~ 201

  • Based on checkpointing time and Young’s formula
  • :;<=>%&'(

)*+, = 16 #";1, :;<=>%&'( )*+, = 12 #";1, :;<=>%&'( )*+, = 7 #";1

slide-64
SLIDE 64

Ex Expe perimental Evalua uation n with h Failur ures

64

Number of convergence iterations with lossy checkpointing for Jacobi, GMRES, and CG

ØFailure Injection

  • MTTI = 1 hour
  • Failure intervals follow an exponential distribution

ØCheckpoint Interval

  • !"#$%&'(

)*+, ~ 120 1, !"#$%&'( 34556755 ~ 70 1, !"#$%&'( 34559 ~ 201

  • Based on checkpointing time and Young’s formula
  • :;<=>%&'(

)*+, = 16 #";1, :;<=>%&'( )*+, = 12 #";1, :;<=>%&'( )*+, = 7 #";1

slide-65
SLIDE 65

Ex Expe perimental Evalua uation n with h Failur ures

65

Number of convergence iterations with lossy checkpointing for Jacobi, GMRES, and CG CG has a delay of convergence by 24.8% on average Jacobi has no delay GMRES has an acceleration

ØFailure Injection

  • MTTI = 1 hour
  • Failure intervals follow an exponential distribution

ØCheckpoint Interval

  • !"#$%&'(

)*+, ~ 120 1, !"#$%&'( 34556755 ~ 70 1, !"#$%&'( 34559 ~ 201

  • Based on checkpointing time and Young’s formula
  • :;<=>%&'(

)*+, = 16 #";1, :;<=>%&'( )*+, = 12 #";1, :;<=>%&'( )*+, = 7 #";1

slide-66
SLIDE 66

Ex Expe perimental Evalua uation n with h Failur ures

66

Number of convergence iterations with lossy checkpointing for Jacobi, GMRES, and CG CG has a delay of convergence by 24.8% on average Jacobi has no delay GMRES has an acceleration

  • Jacobi: FT overhead reduced by 59% compared with

traditional ckpt and 24% compared with lossless ckpt

  • GMRES: FT overhead reduced by 70% and 58%
  • CG: FT overhead reduced by 23% and 20%

Experimental results are very close to theoretical analysis!

ØFailure Injection

  • MTTI = 1 hour
  • Failure intervals follow an exponential distribution

ØCheckpoint Interval

  • !"#$%&'(

)*+, ~ 120 1, !"#$%&'( 34556755 ~ 70 1, !"#$%&'( 34559 ~ 201

  • Based on checkpointing time and Young’s formula
  • :;<=>%&'(

)*+, = 16 #";1, :;<=>%&'( )*+, = 12 #";1, :;<=>%&'( )*+, = 7 #";1

slide-67
SLIDE 67

Con Conclusion

  • n

Ø Propose an efficient lossy checkpointing scheme to improve C/R performance for iterative methods Ø Formulate a lossy checkpointing performance model Ø Quantify the tradeoff between reduced overhead and extra # of iterations Ø Analyze the impact of lossy checkpointing on multiple iterative methods (stationary, GMRES, CG) Ø Evaluate lossy checkpointing on a HPC environment with 2,048 cores Ø Experiments show our lossy checkpointing can significantly reduce the fault tolerance overhead in

the presence of failures

  • Reduced by 23%~70% compared with traditional checkpoint and by 20%~58% with lossless checkpoint

Ø Future work

ØExplore lossy checkpointing in other scientific computational components (such as AMG, AMR, FFT) ØEvaluate lossy checkpointing in real HPC simulations ØEvaluate lossy checkpointing in other I/O intensive and error resilient applications

67

slide-68
SLIDE 68

Ac Acknowled edge

68

This research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S. Department of Energy’s Office of Science and National Nuclear Security Administration, responsible for delivering a capable exascale ecosystem, including software, applications, and hardware technology, to support the nation’s exascale computing imperative. The material was also supported by and supported by the National Science Foundation under Grant No. 1305624,

  • No. 1513201, and No. 1619253.
slide-69
SLIDE 69

Thank you!

Any questions are welcome!

69

Contact: Dingwen Tao (dingwen.tao@ieee.org)