Algorithm-Based Fault Tolerance for Linear Algebra Thomas Herault - - PowerPoint PPT Presentation

algorithm based fault tolerance for linear algebra
SMART_READER_LITE
LIVE PREVIEW

Algorithm-Based Fault Tolerance for Linear Algebra Thomas Herault - - PowerPoint PPT Presentation

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion Algorithm-Based Fault Tolerance for Linear Algebra Thomas Herault University of Tennessee Knoxville http://icl.utk.edu/~herault/slides/AER-2013.pdf AES 2013,


slide-1
SLIDE 1

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Algorithm-Based Fault Tolerance for Linear Algebra

Thomas Herault University of Tennessee Knoxville

http://icl.utk.edu/~herault/slides/AER-2013.pdf

AES 2013, Eugene, OR

herault@icl.utk.edu ABFT for Linear Algebra 1/ 66

slide-2
SLIDE 2

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Thanks

UT Knoxville George Bosilca Aur´ elien Bouteiller Jack Dongarra PhD students (Wesley Bland, Peng Du) INRIA & ENS Lyon Yves Robert, Fr´ ed´ eric Vivien PhD students (Guillaume Aupy, Dounia Zaidouni) Others Franck Cappello, UIUC-Inria joint lab Henri Casanova, Univ. Hawai‘i Amina Guermouche, UIUC-Inria joint lab

herault@icl.utk.edu ABFT for Linear Algebra 2/ 66

slide-3
SLIDE 3

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Outline

1

Motivation Large Scale & Failures Checkpointing Approaches Coordinated Checkpointing – Young/Daly’s approximation Hierarchical checkpointing Evaluation of Checkpointing Approaches for Large Scale Platforms Conclusion on Checkpointing Approaches

2

Algorithm-Based Fault Tolerance Principle Example: LU Factorization Step by Step ABFT LU

3

Implementing ABFT on MPI Fault-Tolerance & the MPI Standard User-Level Failure Mitigation Checkpoint on Failures

4

Conclusion herault@icl.utk.edu ABFT for Linear Algebra 3/ 66

slide-4
SLIDE 4

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Outline

1

Motivation Large Scale & Failures Checkpointing Approaches Coordinated Checkpointing – Young/Daly’s approximation Hierarchical checkpointing Evaluation of Checkpointing Approaches for Large Scale Platforms Conclusion on Checkpointing Approaches

2

Algorithm-Based Fault Tolerance Principle Example: LU Factorization Step by Step ABFT LU

3

Implementing ABFT on MPI Fault-Tolerance & the MPI Standard User-Level Failure Mitigation Checkpoint on Failures

4

Conclusion

herault@icl.utk.edu ABFT for Linear Algebra 4/ 66

slide-5
SLIDE 5

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Outline

1

Motivation Large Scale & Failures Checkpointing Approaches Coordinated Checkpointing – Young/Daly’s approximation Hierarchical checkpointing Evaluation of Checkpointing Approaches for Large Scale Platforms Conclusion on Checkpointing Approaches

2

Algorithm-Based Fault Tolerance Principle Example: LU Factorization Step by Step ABFT LU

3

Implementing ABFT on MPI Fault-Tolerance & the MPI Standard User-Level Failure Mitigation Checkpoint on Failures

4

Conclusion herault@icl.utk.edu ABFT for Linear Algebra 5/ 66

slide-6
SLIDE 6

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Exascale platforms (courtesy Jack Dongarra)

Potential System Architecture with a cap of $200M and 20MW

Systems 2011

K computer

2019 Difference Today & 2019 System peak

10.5 Pflop/s 1 Eflop/s O(100)

Power

12.7 MW ~20 MW

System memory

1.6 PB 32 - 64 PB O(10)

Node performance

128 GF 1,2 or 15TF O(10) – O(100)

Node memory BW

64 GB/s 2 - 4TB/s O(100)

Node concurrency

8 O(1k) or 10k O(100) – O(1000)

Total Node Interconnect BW

20 GB/s 200-400GB/s O(10)

System size (nodes)

88,124 O(100,000) or O(1M) O(10) – O(100)

Total concurrency

705,024 O(billion) O(1,000)

MTTI

days O(1 day)

  • O(10)

herault@icl.utk.edu ABFT for Linear Algebra 6/ 66

slide-7
SLIDE 7

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Exascale platforms

Hierarchical

  • 105 or 106 nodes
  • Each node equipped with 104 or 103 cores

Failure-prone MTBF – one node 1 year 10 years 120 years MTBF – platform 30sec 5mn 1h

  • f 106 nodes

More nodes ⇒ Shorter MTBF (Mean Time Between Failures)

herault@icl.utk.edu ABFT for Linear Algebra 7/ 66

slide-8
SLIDE 8

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Exascale platforms

Hierarchical

  • 105 or 106 nodes
  • Each node equipped with 104 or 103 cores

Failure-prone MTBF – one node 1 year 10 years 120 years MTBF – platform 30sec 5mn 1h

  • f 106 nodes

More nodes ⇒ Shorter MTBF (Mean Time Between Failures)

Exascale = Petascale ×1000

herault@icl.utk.edu ABFT for Linear Algebra 7/ 66

slide-9
SLIDE 9

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Outline

1

Motivation Large Scale & Failures Checkpointing Approaches Coordinated Checkpointing – Young/Daly’s approximation Hierarchical checkpointing Evaluation of Checkpointing Approaches for Large Scale Platforms Conclusion on Checkpointing Approaches

2

Algorithm-Based Fault Tolerance Principle Example: LU Factorization Step by Step ABFT LU

3

Implementing ABFT on MPI Fault-Tolerance & the MPI Standard User-Level Failure Mitigation Checkpoint on Failures

4

Conclusion herault@icl.utk.edu ABFT for Linear Algebra 8/ 66

slide-10
SLIDE 10

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Coordinated checkpointing protocols

Coordinated checkpoints over all processes Global restart after a failure P0 P1 P2 m1 m2 m3 m4 m5

No risk of cascading rollbacks No need to log messages All processors need to roll back

herault@icl.utk.edu ABFT for Linear Algebra 9/ 66

slide-11
SLIDE 11

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Message logging protocols

Message Payload Logging: in sender memory Events Logging: in stable memory (replicated) Restart only the failed processes P0 P1 P2 m1 m2 m3 m4 m5

No cascading rollbacks Number of processes to roll back Memory occupation Overhead

herault@icl.utk.edu ABFT for Linear Algebra 10/ 66

slide-12
SLIDE 12

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Hierarchical protocols

Clusters of processes Coordinated checkpointing protocol within clusters Message logging protocols between clusters Only processors from group(s) with failed process(es) need to roll back P0 P1 P2 P3 m1 m2 m3 m4 m5

Need to log inter-groups message payload

  • Slowdowns failure-free execution
  • Increases checkpoint size/time

Avoid to log intra-groups message payload Faster re-execution with logged messages

herault@icl.utk.edu ABFT for Linear Algebra 11/ 66

slide-13
SLIDE 13

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Which checkpointing protocol to use?

Coordinated checkpointing

No risk of cascading rollbacks No need to log messages All processors need to roll back Rumor: May not scale to very large platforms

Hierarchical checkpointing

Need to log inter-groups messages

  • Slowdowns failure-free execution
  • Increases checkpoint size/time

Only processors from failed group need to roll back Faster re-execution with logged messages Rumor: Should scale to very large platforms

herault@icl.utk.edu ABFT for Linear Algebra 12/ 66

slide-14
SLIDE 14

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Periodical Checkpointing: what Period to use?

Intuition Short Period = ⇒ small risk of losing work Short Period = ⇒ more overhead Long Period = ⇒ low overhead Long Period = ⇒ high risk of losing work Optimal Period Computation Model the Waste as a function of T, µP, etc... Find minimal of this function

dWaste dT = 0

herault@icl.utk.edu ABFT for Linear Algebra 13/ 66

slide-15
SLIDE 15

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Outline

1

Motivation Large Scale & Failures Checkpointing Approaches Coordinated Checkpointing – Young/Daly’s approximation Hierarchical checkpointing Evaluation of Checkpointing Approaches for Large Scale Platforms Conclusion on Checkpointing Approaches

2

Algorithm-Based Fault Tolerance Principle Example: LU Factorization Step by Step ABFT LU

3

Implementing ABFT on MPI Fault-Tolerance & the MPI Standard User-Level Failure Mitigation Checkpoint on Failures

4

Conclusion herault@icl.utk.edu ABFT for Linear Algebra 14/ 66

slide-16
SLIDE 16

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Checkpointing cost

Checkpointing the first chunk Computing the first chunk Processing the second chunk Processing the first chunk

Time Time spent checkpointing Time spent working

Blocking model: checkpointing blocks all computations

herault@icl.utk.edu ABFT for Linear Algebra 15/ 66

slide-17
SLIDE 17

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Checkpointing cost

Checkpointing the first chunk Computing the first chunk Processing the second chunk Processing the first chunk

Time Time spent checkpointing Time spent working

Non-blocking model: checkpointing has no impact on computations (e.g., first copy state to RAM, then copy RAM to disk)

herault@icl.utk.edu ABFT for Linear Algebra 15/ 66

slide-18
SLIDE 18

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Checkpointing cost

Checkpointing the first chunk Computing the first chunk Processing the first chunk

Time Time spent working Time spent checkpointing Time spent working with slowdown

General model: checkpointing slows computations down: during a checkpoint of duration C, the same amount of computation is done as during a time αC without checkpointing (0 ≤ α ≤ 1)

herault@icl.utk.edu ABFT for Linear Algebra 15/ 66

slide-19
SLIDE 19

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Waste for Coordinated Checkpointing

∆ αC C T − C R D Tlost P0 P2 P1 P3 Time spent working Time spent checkpointing Time spent working with slowdown Re-executing slowed-down work Recovery time Downtime T Time

Re-Exec: ∆ − T = Tlost + αC Expectation: Tlost = 1

2(T − C)

Re-Execcoord−fail−in−work = D + R + T − C 2 + αC

herault@icl.utk.edu ABFT for Linear Algebra 16/ 66

slide-20
SLIDE 20

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Waste due to failures

Failure in the computation phase (probability: T−C

T

) Re-Execcoord−fail−in−work = αC + T − C 2 Failure in the checkpointing phase (probability: C

T )

Re-Execcoord−fail−in−checkpoint = αC + T − C + C 2 Re-Execcoord = T − C T T − C 2 + αC

  • + C

T

  • T − C

2 + αC

  • = αC + T

2

herault@icl.utk.edu ABFT for Linear Algebra 17/ 66

slide-21
SLIDE 21

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Total waste

Waste[FF] = (1−α)C

T

Waste[fail] = 1

µ

  • D + R + αC + T

2

  • Waste

= 1 − (1 − Waste[FF]) · (1 − Waste[fail]) = Waste[FF] + Waste[fail] − Waste[FF]Waste[fail] Optimal period T∗ =

  • 2(1 − α)(µ − (D + R))C

herault@icl.utk.edu ABFT for Linear Algebra 18/ 66

slide-22
SLIDE 22

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Total waste

Waste[FF] = (1−α)C

T

Waste[fail] = 1

µ

  • D + R + αC + T

2

  • Waste

= 1 − (1 − Waste[FF]) · (1 − Waste[fail]) = Waste[FF] + Waste[fail] − Waste[FF]Waste[fail] Optimal period T∗ =

  • 2(1 − α)(µ − (D + R))C

herault@icl.utk.edu ABFT for Linear Algebra 18/ 66

slide-23
SLIDE 23

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Validity of the approach

Technicalities E (Nfaults) = Timefinal

µ

and E (Tlost) = D + R+ = T

2

but expectation of product is not product of expectations (not independent RVs here) Enforce many constraints (C ≤ T to get Waste[FF] ≤ 1, D + R ≤ µ and bound T to get Waste[fail] ≤ 1, but µ = µind

p

too small for large p, regardless of µind, ...) Waste[fail] accurate only when two or more faults do not take place within same period Optimal period

  • 2(µ − (D + R))C may not belong to

admissible interval [C, γµ] Approach surprisingly robust, as shown by simulations

herault@icl.utk.edu ABFT for Linear Algebra 19/ 66

slide-24
SLIDE 24

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Outline

1

Motivation Large Scale & Failures Checkpointing Approaches Coordinated Checkpointing – Young/Daly’s approximation Hierarchical checkpointing Evaluation of Checkpointing Approaches for Large Scale Platforms Conclusion on Checkpointing Approaches

2

Algorithm-Based Fault Tolerance Principle Example: LU Factorization Step by Step ABFT LU

3

Implementing ABFT on MPI Fault-Tolerance & the MPI Standard User-Level Failure Mitigation Checkpoint on Failures

4

Conclusion herault@icl.utk.edu ABFT for Linear Algebra 20/ 66

slide-25
SLIDE 25

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Hierarchical checkpointing

Processors partitioned into G groups Each group includes q processors Inside each group: coordinated checkpointing in time C(q) Inter-group messages are logged

herault@icl.utk.edu ABFT for Linear Algebra 21/ 66

slide-26
SLIDE 26

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Waste (during computation) of Hierarchical Checkpointing

T α(G −g +1)C R D G.C T −G.C −Tlost Tlost Tlost G2 G4 Gg G1 G5 Re-executing slowed-down work Recovery time Downtime Time spent working Time spent working with slowdown Time spent checkpointing Time

Expected Re-Exec: T−G.C

2

+ α(G − g + 1)C Averaging over groups: Re-Execcomp =

1 G

G

  • g=1

T − G.C(q) 2 + α(G − g + 1)C(q)

  • = T − G.C(q)

2 + αG + 1 2 C

herault@icl.utk.edu ABFT for Linear Algebra 22/ 66

slide-27
SLIDE 27

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Total waste

Waste[FF] = T − Work T with Work = T − (1 − α)GC(q) Waste[fail] = 1 µ

  • D(q) + R(q) + Re-Exec
  • with

Re-Exec = T −GC(q) T Re-Execcomp + GC(q) T Re-Execckpt Waste = Waste[FF] + Waste[fail] − Waste[FF]Waste[fail] Minimize Waste subject to: GC(q) ≤ T (by construction) T ≤ γµ (capping period as before) Gets complicated! Use computer algebra software

herault@icl.utk.edu ABFT for Linear Algebra 23/ 66

slide-28
SLIDE 28

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Accounting for message logging: Impact on work Logging messages slows down execution:

⇒ Work becomes λWork, where 0 < λ < 1 Typical value: λ ≈ 0.98

Re-execution after a failure is faster:

⇒ Re-Exec becomes Re-Exec

ρ

, where ρ ∈ [1..2] Typical value: ρ ≈ 1.5 Waste[FF] = T − λWork T Waste[fail] = 1 µ

  • D(q) + R(q) + Re-Exec

ρ

  • herault@icl.utk.edu

ABFT for Linear Algebra 24/ 66

slide-29
SLIDE 29

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Accounting for message logging: Impact on checkpoint size

Inter-groups messages logged continuously Checkpoint size increases with amount of work executed before a checkpoint C0(q): Checkpoint size of a group without message logging C(q) = C0(q)(1 + βWork) ⇔ β = C(q) − C0(q) C0(q)Work Work = λ(T − (1 − α)GC(q)) C(q) = C0(q)(1 + βλT) 1 + GC0(q)βλ(1 − α) Constraint GC(q) ≤ T translates into GC0(q)βλα ≤ 1 and T ≥ GC0(q) 1 − GC0(q)βλα

herault@icl.utk.edu ABFT for Linear Algebra 25/ 66

slide-30
SLIDE 30

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Outline

1

Motivation Large Scale & Failures Checkpointing Approaches Coordinated Checkpointing – Young/Daly’s approximation Hierarchical checkpointing Evaluation of Checkpointing Approaches for Large Scale Platforms Conclusion on Checkpointing Approaches

2

Algorithm-Based Fault Tolerance Principle Example: LU Factorization Step by Step ABFT LU

3

Implementing ABFT on MPI Fault-Tolerance & the MPI Standard User-Level Failure Mitigation Checkpoint on Failures

4

Conclusion herault@icl.utk.edu ABFT for Linear Algebra 26/ 66

slide-31
SLIDE 31

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Three case studies

Coord-IO Coordinated approach: C = CMem = Mem

bio

where Mem is the memory footprint of the application Hierarch-IO Several (large) groups, I/O-saturated ⇒ groups checkpoint sequentially C0(q) = CMem G = Mem Gbio Hierarch-Port Very large number of smaller groups, port-saturated ⇒ some groups checkpoint in parallel Groups of qmin processors, where qminbport ≥ bio

herault@icl.utk.edu ABFT for Linear Algebra 27/ 66

slide-32
SLIDE 32

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Computing β for Matrix Product

  • 3 matrices of size n × n partitioned across a p × p processor grid
  • Mem = 24n2 (in bytes)
  • Each processor holds three matrix blocks of size b = n/p
  • At each iteration (Cannon’s algorithm):
  • shift one block vertically and one horizontally
  • perform a matrix product
  • (Parallel) work for one iteration is Work = 2b3

sp

1 Hierarch-IO: one group per grid row: β =

sp 6b3

2 Hierarch-Port: groups of size qmin: β =

sp 3b3

herault@icl.utk.edu ABFT for Linear Algebra 28/ 66

slide-33
SLIDE 33

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Four platforms: basic characteristics

Name Number of Number of Number of cores Memory I/O Network Bandwidth (bio) I/O Bandwidth (bport) cores processors ptotal per processor per processor Read Write Read/Write per processor Titan 299,008 16,688 16 32GB 300GB/s 300GB/s 20GB/s K-Computer 705,024 88,128 8 16GB 150GB/s 96GB/s 20GB/s Exascale-Slim 1,000,000,000 1,000,000 1,000 64GB 1TB/s 1TB/s 200GB/s Exascale-Fat 1,000,000,000 100,000 10,000 640GB 1TB/s 1TB/s 400GB/s

herault@icl.utk.edu ABFT for Linear Algebra 29/ 66

slide-34
SLIDE 34

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Four platforms: Matrix-Product

Name Scenario G (C(q)) β for Matrix-Product Coord-IO 1 (2,048s) / Titan Hierarch-IO 136 (15s) 0.0004280 Hierarch-Port 1,246 (1.6s) 0.0008561 Coord-IO 1 (14,688s) / K-Computer Hierarch-IO 296 (50s) 0.001113 Hierarch-Port 17,626 (0.83s) 0.002227 Coord-IO 1 (64,000s) / Exascale-Slim Hierarch-IO 1,000 (64s) 0.001013 Hierarch-Port 200,0000 (0.32s) 0.002026 Coord-IO 1 (64,000s) / Exascale-Fat Hierarch-IO 316 (217s) 0.0003203 Hierarch-Port 33,3333 (1.92s) 0.0006407

herault@icl.utk.edu ABFT for Linear Algebra 30/ 66

slide-35
SLIDE 35

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Four platforms: Matrix-Product

Name Scenario G (C(q)) β for Matrix-Product Coord-IO 1 (2,048s) / Titan Hierarch-IO 136 (15s) 0.0004280 Hierarch-Port 1,246 (1.6s) 0.0008561 Coord-IO 1 (14,688s) / K-Computer Hierarch-IO 296 (50s) 0.001113 Hierarch-Port 17,626 (0.83s) 0.002227 Coord-IO 1 (64,000s) / Exascale-Slim Hierarch-IO 1,000 (64s) 0.001013 Hierarch-Port 200,0000 (0.32s) 0.002026 Coord-IO 1 (64,000s) / Exascale-Fat Hierarch-IO 316 (217s) 0.0003203 Hierarch-Port 33,3333 (1.92s) 0.0006407

herault@icl.utk.edu ABFT for Linear Algebra 30/ 66

slide-36
SLIDE 36

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Plotting formulas

Titan K-Computer Exascale (Fat or Slim) Waste = 1 for all MTBF!!! Waste as a function of node MTBF µ in years (logscale)

herault@icl.utk.edu ABFT for Linear Algebra 31/ 66

slide-37
SLIDE 37

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Plotting formulas –Exascale w/ better ckpt tech.

C = 1000 C = 100 Exascale Slim Exascale Fat Waste as a function of node MTBF µ in years (logscale)

herault@icl.utk.edu ABFT for Linear Algebra 32/ 66

slide-38
SLIDE 38

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Outline

1

Motivation Large Scale & Failures Checkpointing Approaches Coordinated Checkpointing – Young/Daly’s approximation Hierarchical checkpointing Evaluation of Checkpointing Approaches for Large Scale Platforms Conclusion on Checkpointing Approaches

2

Algorithm-Based Fault Tolerance Principle Example: LU Factorization Step by Step ABFT LU

3

Implementing ABFT on MPI Fault-Tolerance & the MPI Standard User-Level Failure Mitigation Checkpoint on Failures

4

Conclusion herault@icl.utk.edu ABFT for Linear Algebra 33/ 66

slide-39
SLIDE 39

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Conclusion on Checkpointing Approaches

Checkpointing Approaches: positive Generic technique – Algorithm agnostic Used not only for FT (e.g., post-mortem simulation analysis, parameter sweeping during simulations) Checkpointing Approaches: negative Projections show that the I/O bottleneck becomes crucial Calls for a O(100 − 1000) improvement in (process) checkpoint time. Combination of:

Better I/O backbone Distributed Storage on Nodes (RM, SSD) Incremental Checkpointing (ckpt every 100 instr. ⇒ loss of optimal ckpt interval) User-guided Checkpointing (ckpt relevant data ⇒ loss of transparency esp. restart)

herault@icl.utk.edu ABFT for Linear Algebra 34/ 66

slide-40
SLIDE 40

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Outline

1

Motivation Large Scale & Failures Checkpointing Approaches Coordinated Checkpointing – Young/Daly’s approximation Hierarchical checkpointing Evaluation of Checkpointing Approaches for Large Scale Platforms Conclusion on Checkpointing Approaches

2

Algorithm-Based Fault Tolerance Principle Example: LU Factorization Step by Step ABFT LU

3

Implementing ABFT on MPI Fault-Tolerance & the MPI Standard User-Level Failure Mitigation Checkpoint on Failures

4

Conclusion

herault@icl.utk.edu ABFT for Linear Algebra 35/ 66

slide-41
SLIDE 41

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Outline

1

Motivation Large Scale & Failures Checkpointing Approaches Coordinated Checkpointing – Young/Daly’s approximation Hierarchical checkpointing Evaluation of Checkpointing Approaches for Large Scale Platforms Conclusion on Checkpointing Approaches

2

Algorithm-Based Fault Tolerance Principle Example: LU Factorization Step by Step ABFT LU

3

Implementing ABFT on MPI Fault-Tolerance & the MPI Standard User-Level Failure Mitigation Checkpoint on Failures

4

Conclusion herault@icl.utk.edu ABFT for Linear Algebra 36/ 66

slide-42
SLIDE 42

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Principle of ABFT

A

  • B
  • A

C

  • B

C ′     Operation Operation C = Cksum(A) C ′ = Cksum(B) Principle of ABFT Input Data (A) and Result (B) are distributed Operation preserves Checksum properties Apply the operation on Data + Checksum (AC) In case of failure, recover the missing data by inversion of the checksum

herault@icl.utk.edu ABFT for Linear Algebra 37/ 66

slide-43
SLIDE 43

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Outline

1

Motivation Large Scale & Failures Checkpointing Approaches Coordinated Checkpointing – Young/Daly’s approximation Hierarchical checkpointing Evaluation of Checkpointing Approaches for Large Scale Platforms Conclusion on Checkpointing Approaches

2

Algorithm-Based Fault Tolerance Principle Example: LU Factorization Step by Step ABFT LU

3

Implementing ABFT on MPI Fault-Tolerance & the MPI Standard User-Level Failure Mitigation Checkpoint on Failures

4

Conclusion herault@icl.utk.edu ABFT for Linear Algebra 38/ 66

slide-44
SLIDE 44

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Example: LU Factorization

P=2 Q=3 M=10 × mb N=9 × nb 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4

Recursive block LU Factorization (ScaLAPACK) Want to solve Ax = b (hard) Transform A into LU Solve Ly = Pb, then Ux = y

herault@icl.utk.edu ABFT for Linear Algebra 39/ 66

slide-45
SLIDE 45

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Example: LU Factorization

GETF2: Factorize a column block TRSM GEMM Update the trailing matrix 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4

Recursive block LU Factorization (ScaLAPACK) Want to solve Ax = b (hard) Transform A into LU Solve Ly = Pb, then Ux = y

herault@icl.utk.edu ABFT for Linear Algebra 39/ 66

slide-46
SLIDE 46

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Example: LU Factorization

GETF2: Factorize a column block TRSM GEMM Update the trailing matrix 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4

Recursive block LU Factorization (ScaLAPACK) Want to solve Ax = b (hard) Transform A into LU Solve Ly = Pb, then Ux = y

herault@icl.utk.edu ABFT for Linear Algebra 39/ 66

slide-47
SLIDE 47

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Example: LU Factorization

GETF2: Factorize a column block TRSM GEMM Update the trailing matrix 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4

Recursive block LU Factorization (ScaLAPACK) Want to solve Ax = b (hard) Transform A into LU Solve Ly = Pb, then Ux = y

herault@icl.utk.edu ABFT for Linear Algebra 39/ 66

slide-48
SLIDE 48

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Example: LU Factorization

GETF2: Factorize a column block TRSM GEMM Update the trailing matrix 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4

Recursive block LU Factorization (ScaLAPACK) Want to solve Ax = b (hard) Transform A into LU Solve Ly = Pb, then Ux = y

herault@icl.utk.edu ABFT for Linear Algebra 39/ 66

slide-49
SLIDE 49

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Single Failure ⇒ many losses

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4

Recursive block LU Factorization (ScaLAPACK) A single failure ⇒ many blocks of data lost 2D-Block Cyclic Distribution

herault@icl.utk.edu ABFT for Linear Algebra 40/ 66

slide-50
SLIDE 50

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Checksum Block Columns

  • 1

1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4

How to Recover Missing Data: Checksum Block Columns Q blocks of the same line are summed up: a new Checksum Block is created ⌈ N

Q ⌉ extra block columns

Reverse Neighboring Checksum Storage Checksum Blocks are stored in the 2D block cyclic distribution

herault@icl.utk.edu ABFT for Linear Algebra 41/ 66

slide-51
SLIDE 51

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Checksum Block Columns

  • 1

1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4

New Issue how to maintain the checksum block columns during the computation, and despite failures?

herault@icl.utk.edu ABFT for Linear Algebra 41/ 66

slide-52
SLIDE 52

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Update of the checksums

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4

Update of the Checksum Columns Trailing Matrix Update Operations (TRSM, GEMM) preserve the Checksum Property (affine transformation) Apply the same parallel operation on a larger scope of data Reduce scope as the panel progresses

herault@icl.utk.edu ABFT for Linear Algebra 42/ 66

slide-53
SLIDE 53

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Update of the checksums

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4

Update of the Checksum Columns Part of the checksum is corrupted: GETF2 operation does not update the checksum

herault@icl.utk.edu ABFT for Linear Algebra 42/ 66

slide-54
SLIDE 54

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Update of the checksums

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4

Update of the Checksum Columns Updating the corrupted part of the Checksum after every panel is too slow (only P processors work on the panel ⇒ (Q − 1)P processors wait

herault@icl.utk.edu ABFT for Linear Algebra 42/ 66

slide-55
SLIDE 55

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Update of the checksums

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4

Update of the Checksum Columns Work in a transaction. Locally copy the block column of the panel, do Q panels and updates then only update the checksum and discard the panel copies.

herault@icl.utk.edu ABFT for Linear Algebra 42/ 66

slide-56
SLIDE 56

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Update of the checksums

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4

Update of the Checksum Columns Work in a transaction. Locally copy the block column of the panel, do Q panels and updates then only update the checksum and discard the panel copies.

herault@icl.utk.edu ABFT for Linear Algebra 42/ 66

slide-57
SLIDE 57

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Update of the checksums

  • 1

1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4

Update of the Checksum Columns Work in a transaction. Locally copy the block column of the panel, do Q panels and updates then only update the checksum and discard the panel copies.

herault@icl.utk.edu ABFT for Linear Algebra 42/ 66

slide-58
SLIDE 58

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Failure Handling

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4

First Issue Checksum Integrity If Checksum blocks and covered data blocks fail together, data is not recoverable

herault@icl.utk.edu ABFT for Linear Algebra 43/ 66

slide-59
SLIDE 59

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Failure Handling

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4

First Issue Dupplication (affine combination) of Checksum Columns Groups of Checksum Columns are updated together Scope of update operation reduced by size of group Use (F + 1)⌈ N

Q ⌉ additional block columns

herault@icl.utk.edu ABFT for Linear Algebra 43/ 66

slide-60
SLIDE 60

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Checksum Validity

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4

Second Issue Checksum Validity Failures happen at worst times in a middle of a Q−panel, the checksum is lost for Panel Blocks in a middle of an update, the checksum blocks are not updated yet

herault@icl.utk.edu ABFT for Linear Algebra 44/ 66

slide-61
SLIDE 61

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Outline

1

Motivation Large Scale & Failures Checkpointing Approaches Coordinated Checkpointing – Young/Daly’s approximation Hierarchical checkpointing Evaluation of Checkpointing Approaches for Large Scale Platforms Conclusion on Checkpointing Approaches

2

Algorithm-Based Fault Tolerance Principle Example: LU Factorization Step by Step ABFT LU

3

Implementing ABFT on MPI Fault-Tolerance & the MPI Standard User-Level Failure Mitigation Checkpoint on Failures

4

Conclusion herault@icl.utk.edu ABFT for Linear Algebra 45/ 66

slide-62
SLIDE 62

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Step by Step ABFT-LU

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 3 2 3 2 3 2 3 5 4 5 4 5 4 5 5 4 5 4 5 4 5 1 1 1 1

Step 0 Begin Transaction

herault@icl.utk.edu ABFT for Linear Algebra 46/ 66

slide-63
SLIDE 63

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Step by Step ABFT-LU

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 3 2 3 2 3 2 3 5 4 5 4 5 4 5 5 4 5 4 5 4 5 1 1 1 1

Step 1 Create local copy of Q−panel, and corresponding checksum blocks

herault@icl.utk.edu ABFT for Linear Algebra 46/ 66

slide-64
SLIDE 64

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Step by Step ABFT-LU

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 3 2 3 2 3 2 3 5 4 5 4 5 4 5 5 4 5 4 5 4 5 1 1 1 1

Step 2 Apply Panels, one after the other

herault@icl.utk.edu ABFT for Linear Algebra 46/ 66

slide-65
SLIDE 65

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Step by Step ABFT-LU

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 3 2 3 2 3 2 3 5 4 5 4 5 4 5 5 4 5 4 5 4 5 1 1 1 1

Step i: Failure happens Failure happens during a Q−panel, after some updates have been applied.

herault@icl.utk.edu ABFT for Linear Algebra 46/ 66

slide-66
SLIDE 66

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Step by Step ABFT-LU

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 3 2 3 2 3 2 3 5 4 5 4 5 4 5 5 4 5 4 5 4 5 1 1 1 1

Step i+1: Recover Missing Data Using redundancy in checksum, recover checksum data

herault@icl.utk.edu ABFT for Linear Algebra 46/ 66

slide-67
SLIDE 67

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Step by Step ABFT-LU

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 3 2 3 2 3 2 3 5 4 5 4 5 4 5 5 4 5 4 5 4 5 1 1 1 1

Step i+2: Recover Missing Data Using valid part of the checksums, recover “stable” data, by inverting the checksum

herault@icl.utk.edu ABFT for Linear Algebra 46/ 66

slide-68
SLIDE 68

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Step by Step ABFT-LU

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 3 2 3 2 3 2 3 5 4 5 4 5 4 5 5 4 5 4 5 4 5 1 1 1 1

Step i+2’: Recover Missing Data Using valid part of the checksums, recover “stable” data, by inverting the checksum Especially for the current Q−panel in the transaction copy

herault@icl.utk.edu ABFT for Linear Algebra 46/ 66

slide-69
SLIDE 69

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Step by Step ABFT-LU

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 3 2 3 2 3 2 3 5 4 5 4 5 4 5 5 4 5 4 5 4 5 1 1 1 1

Step i+3: Rollback Q−panel Rollback the Q−panels from local copy

herault@icl.utk.edu ABFT for Linear Algebra 46/ 66

slide-70
SLIDE 70

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Step by Step ABFT-LU

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 3 2 3 2 3 2 3 5 4 5 4 5 4 5 5 4 5 4 5 4 5 1 1 1 1

Step i+4: Redo the Q−panels Until panel subject to failure Applying only the new updates Both on the data and spanning checksum columns

herault@icl.utk.edu ABFT for Linear Algebra 46/ 66

slide-71
SLIDE 71

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Step by Step ABFT-LU

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 3 2 3 2 3 2 3 5 4 5 4 5 4 5 5 4 5 4 5 4 5 1 1 1 1

Step i+5: Redo the Q−panels This fixes the checksum property

herault@icl.utk.edu ABFT for Linear Algebra 46/ 66

slide-72
SLIDE 72

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Step by Step ABFT-LU

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 3 2 3 2 3 2 3 5 4 5 4 5 4 5 5 4 5 4 5 4 5 1 1 1 1

Step i+6: Computing Missing Data Invert Checksum on remaining blocks

herault@icl.utk.edu ABFT for Linear Algebra 46/ 66

slide-73
SLIDE 73

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Step by Step ABFT-LU

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 3 2 3 2 3 2 3 5 4 5 4 5 4 5 5 4 5 4 5 4 5 1 1 1 1

Step i+6+...: Complete Q−panel transaction Continue algorithm until the end of the Q−panel transaction

herault@icl.utk.edu ABFT for Linear Algebra 46/ 66

slide-74
SLIDE 74

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Step by Step ABFT-LU

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 3 2 3 2 3 2 3 5 4 5 4 5 4 5 5 4 5 4 5 4 5 1 1 1 1

Step i+6+...: Complete Q−panel transaction Continue algorithm until the end of the Q−panel transaction

herault@icl.utk.edu ABFT for Linear Algebra 46/ 66

slide-75
SLIDE 75

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Step by Step ABFT-LU

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 3 2 3 2 3 2 3 5 4 5 4 5 4 5 5 4 5 4 5 4 5 1 1 1 1

Step i+6+...: Complete Q−panel transaction Compute and set checksum of finished Q−panels

herault@icl.utk.edu ABFT for Linear Algebra 46/ 66

slide-76
SLIDE 76

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Step by Step ABFT-LU

1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 1 1 1 3 2 3 2 3 2 3 2 3 2 5 4 5 4 5 4 5 4 5 4 1 1 3 2 3 2 5 4 5 4 1 1 3 2 3 2

Step i+6+...: Start next Q−panel transaction discard previous local copies locally copy current Q−panel and checksums

herault@icl.utk.edu ABFT for Linear Algebra 46/ 66

slide-77
SLIDE 77

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Outline

1

Motivation Large Scale & Failures Checkpointing Approaches Coordinated Checkpointing – Young/Daly’s approximation Hierarchical checkpointing Evaluation of Checkpointing Approaches for Large Scale Platforms Conclusion on Checkpointing Approaches

2

Algorithm-Based Fault Tolerance Principle Example: LU Factorization Step by Step ABFT LU

3

Implementing ABFT on MPI Fault-Tolerance & the MPI Standard User-Level Failure Mitigation Checkpoint on Failures

4

Conclusion

herault@icl.utk.edu ABFT for Linear Algebra 47/ 66

slide-78
SLIDE 78

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Outline

1

Motivation Large Scale & Failures Checkpointing Approaches Coordinated Checkpointing – Young/Daly’s approximation Hierarchical checkpointing Evaluation of Checkpointing Approaches for Large Scale Platforms Conclusion on Checkpointing Approaches

2

Algorithm-Based Fault Tolerance Principle Example: LU Factorization Step by Step ABFT LU

3

Implementing ABFT on MPI Fault-Tolerance & the MPI Standard User-Level Failure Mitigation Checkpoint on Failures

4

Conclusion herault@icl.utk.edu ABFT for Linear Algebra 48/ 66

slide-79
SLIDE 79

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

MPI-2 and 3: Fault Tolerance Support

Errors are fatal Application is aborted at first failure Errors return Best case: error returned to the user, and state of MPI is undefined MPI Standard [...] it is the job of the implementor of the MPI subsystem to insulate the user from this unreliability, or to reflect unrecoverable errors as failures. Whenever possible, such failures will be reflected as errors in the relevant communication call. Similarly, MPI itself provides no mechanisms for handling processor failures. – MPI Standard 3.0, p. 20, l. 36:39

herault@icl.utk.edu ABFT for Linear Algebra 49/ 66

slide-80
SLIDE 80

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

MPI-2 and 3: Fault Tolerance Support

Errors are fatal Application is aborted at first failure Errors return Best case: error returned to the user, and state of MPI is undefined MPI Standard This document does not specify the state of a computation after an erroneous MPI call has occurred. – MPI Standard 3.0, p. 21, l. 24:25

herault@icl.utk.edu ABFT for Linear Algebra 49/ 66

slide-81
SLIDE 81

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Non Standard Extensions for Fault-Tolerant MPI

FT-MPI FT-MPI: a fault tolerant MPI Implementation (designed around 2000s) Proven Research Tool

Effective support of some ABFT algorithms

But... Supports only TCP does not support modern networks Not installed by default may be hard to compile / use on recent HPC systems Users are reluctant to use non-standard MPI middleware

herault@icl.utk.edu ABFT for Linear Algebra 50/ 66

slide-82
SLIDE 82

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Outline

1

Motivation Large Scale & Failures Checkpointing Approaches Coordinated Checkpointing – Young/Daly’s approximation Hierarchical checkpointing Evaluation of Checkpointing Approaches for Large Scale Platforms Conclusion on Checkpointing Approaches

2

Algorithm-Based Fault Tolerance Principle Example: LU Factorization Step by Step ABFT LU

3

Implementing ABFT on MPI Fault-Tolerance & the MPI Standard User-Level Failure Mitigation Checkpoint on Failures

4

Conclusion herault@icl.utk.edu ABFT for Linear Algebra 51/ 66

slide-83
SLIDE 83

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Fault-Tolerant MPI, in the MPI-4 Standard?

User-Level Failure Mitigation Proposed MPI API changes for Fault-Tolerant Applications

Operations that can’t complete return ERR PROC FAILED Operations that can complete return MPI SUCCESS

Indicates local success Does not guarantee success on

  • ther ranks

New constructs

Revoke, Agree, Shrink

Bcast

S S PF

Bcast

Revoke

R R

Shrink Bcast

S S S

herault@icl.utk.edu ABFT for Linear Algebra 52/ 66

slide-84
SLIDE 84

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Performance Impact of ULFM on Open MPI (lack of)

Micro Benchmarks No performance difference measured on micro-benchmark P2P latency / bw unchanged (inc. shared memory) Collective comm. Unchanged Overhead below standard deviation

1-byte Latency (microseconds) (cache hot) Interconnect Vanilla Std. Dev. Enabled Std. Dev. Difference Shared Memory 0.8008 0.0093 0.8016 0.0161 0.0008 TCP 10.2564 0.0946 10.2776 0.1065 0.0212 OpenIB 4.9637 0.0018 4.9650 0.0022 0.0013 Bandwidth (Mbps) (cache hot) Interconnect Vanilla Std. Dev. Enabled Std. Dev. Difference Shared Memory 10,625.92 23.46 10,602.68 30.73

  • 23.24

TCP 6,311.38 14.42 6,302.75 10.72

  • 8.63

OpenIB 9,688.85 3.29 9,689.13 3.77 0.28

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5 % difference ULFM is faster Vanilla is faster AllReduce 4B AllReduce 4MB AlltoAll 4B AlltoAll 4MB Bcast 4B Bcast 4MB Reduce 4B Reduce 4MB SendRecv 4B SendRecv 4MB PingPing 4B PingPing 4MB PingPong 4B PingPong 4MB Barrier Bandwidth benchmark Latency benchmark

IMB 48 cores Shared memory

herault@icl.utk.edu ABFT for Linear Algebra 53/ 66

slide-85
SLIDE 85

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Performance Impact of ULFM on Open MPI (lack of)

Application Sequoia AMG: Multigrid solver for unstructured mesh Non-FT Application No performance impact measured

10 20 30 40 50 60 70 80 8 16 32 64 128 256 512 Cumulated Time (s) Number of processes FT no FT FT no FT FT no FT FT no FT FT no FT FT no FT FT no FT Solve Setup SStruct

herault@icl.utk.edu ABFT for Linear Algebra 54/ 66

slide-86
SLIDE 86

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

ABFT LU Performance on ULFM

herault@icl.utk.edu ABFT for Linear Algebra 55/ 66

slide-87
SLIDE 87

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

ABFT QR Performance on ULFM

herault@icl.utk.edu ABFT for Linear Algebra 56/ 66

slide-88
SLIDE 88

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Outline

1

Motivation Large Scale & Failures Checkpointing Approaches Coordinated Checkpointing – Young/Daly’s approximation Hierarchical checkpointing Evaluation of Checkpointing Approaches for Large Scale Platforms Conclusion on Checkpointing Approaches

2

Algorithm-Based Fault Tolerance Principle Example: LU Factorization Step by Step ABFT LU

3

Implementing ABFT on MPI Fault-Tolerance & the MPI Standard User-Level Failure Mitigation Checkpoint on Failures

4

Conclusion herault@icl.utk.edu ABFT for Linear Algebra 57/ 66

slide-89
SLIDE 89

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Checkpoint on Failures: ABFT with rudimentary support

Algorithm 1 The Checkpoint-on-Failure Protocol

  • 1. MPI returns an error on surviving processes
  • 2. Surviving processes checkpoint
  • 3. Surviving processes exit
  • 4. A new MPI application is started
  • 5. Processes load from checkpoint (if any)
  • 6. Processes enter ABFT dataset recovery
  • 7. Application resumes

?

ABFT Recovery 1 2 3 4 5 6 7

Checkpoint on Failures Hybrid approach between trad. ABFT and C/R methods Optimal number of checkpoints (by definition: 1 checkpoint-rollback / fault) No lost work Does not require continued service from MPI after a failure

herault@icl.utk.edu ABFT for Linear Algebra 58/ 66

slide-90
SLIDE 90

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

CoF Requirements

MPI Requirements Returning control after failure Call of the user-defined error-handler, at least when doing direct p2p comm with a dead proces Termination after checkpoint After the error-handler is called, MPI is non functional, but a process may checkpoint and exit. Example These limited requirements are realistic for most MPI-2/3 implementations today

herault@icl.utk.edu ABFT for Linear Algebra 59/ 66

slide-91
SLIDE 91

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

CoF Performance

0.5 1 1.5 2 2.5 3 3.5 20k 40k 60k 80k 100k Performance (Tflops/s) Matrix Size (N) ScaLAPACK ABFT QR (w/o failure) ABFT QR (w/1 CoF recovery)

1 2 3 4 5 6 7 20k 25k 30k 35k 40k 45k 50k Application Time Share (%) Matrix Size (N) Load Checkpoint Dump Checkpoint ABFT Recovery

ABFT QR, Kraken, process grid: 24x24. Block size: 100

herault@icl.utk.edu ABFT for Linear Algebra 60/ 66

slide-92
SLIDE 92

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Outline

1

Motivation Large Scale & Failures Checkpointing Approaches Coordinated Checkpointing – Young/Daly’s approximation Hierarchical checkpointing Evaluation of Checkpointing Approaches for Large Scale Platforms Conclusion on Checkpointing Approaches

2

Algorithm-Based Fault Tolerance Principle Example: LU Factorization Step by Step ABFT LU

3

Implementing ABFT on MPI Fault-Tolerance & the MPI Standard User-Level Failure Mitigation Checkpoint on Failures

4

Conclusion

herault@icl.utk.edu ABFT for Linear Algebra 61/ 66

slide-93
SLIDE 93

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Conclusion

Legacy Checkpoint/Restart Requires tremendous improvement w.r.t. checkpoint technology / storage / transfer

(“doing nothing” for FT is also a dangerous route)

Algorithm-Specific Fault Tolerance Is to be considered. Many applications are susceptible to benefit ABFT methods have great scalability, but are not a panacea: each algorithm/application has to care about FT HPC middleware has to adapt FT support postponed to MPI-Next. Last chance (for MPI)?

herault@icl.utk.edu ABFT for Linear Algebra 62/ 66

slide-94
SLIDE 94

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Bibliography

Algorithm Based Fault Tolerance

[DBB+11]

  • P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra.

Algorithm-based fault tolerance for dense matrix factorizations. Technical Report UT-CS-11-676, University of Tennessee Computer Science, Knoxville, TN, August 2011. http://icl.cs.utk.edu/news_pub/submissions/lawn253.pdf. [DBB+12]

  • P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra.

Algorithm-based fault tolerance for dense matrix factorizations. In J. Ramanujam and P. Sadayappan, editors, Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012, pages 225–234, New Orleans, LA, February 2012. [HA84] Kuang-Hua Huang and J.A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, C-33(6):518–528, 1984.

herault@icl.utk.edu ABFT for Linear Algebra 63/ 66

slide-95
SLIDE 95

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Bibliography

Modeling of System-Level Checkpointing

[BBB+12b] G. Bosilca, A. Bouteiller, E. Brunet, F. Cappello, J. Dongarra,

  • A. Guermouche, T. Herault, Y. Robert, F. Vivien, and D. Zaidouni.

Unified model for assessing checkpointing protocols at extreme-scale. Technical report, Innovative Computing Laboratory, University of Tennessee, jun 2012. [BCD+13]

  • A. Bouteiller, F. Cappello, J. Dongarra, A. Guermouche, T. Herault, ,

and Y. Robert. Multi-criteria checkpointing strategies: optimizing response-time versus resource utilization. Technical report, Innovative Computing Laboratory, University of Tennessee, feb 2013.

Checkpoint on Failure

[BDB+12]

  • W. Bland, P. Du, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra.

A checkpoint-on-failure protocol for algorithm-based recovery in standard mpi. aug 2012.

herault@icl.utk.edu ABFT for Linear Algebra 64/ 66

slide-96
SLIDE 96

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Bibliography

User-Level Failure Mitigation

[BBB+12a] W. Bland, G. Bosilca, A. Bouteiller, T. Herault, and J. Dongarra. A proposal for user-level failure mitigation in the mpi-3 standard. Technical report, Innovative Computing Laboratory, University of Tennessee, feb 2012. [BBH+13]

  • W. Bland, A. Bouteiller, T. Herault, J. Hursey, G. Bosilca, and J.J.

Dongarra. An evaluation of user-level failure mitigation support in mpi. Computing, pages 1–14, may 2013.

herault@icl.utk.edu ABFT for Linear Algebra 65/ 66

slide-97
SLIDE 97

Motivation Algorithm-Based Fault Tolerance Implementing ABFT on MPI Conclusion

Bibliography

Algorithm Based Fault Tolerance (cont. soft errors)

[DLD11]

  • P. Du, P. Luszczek, and J. Dongarra.

High performance dense linear system solver with soft error resilience. IEEE Cluster 2011, sep 2011. [DLD12]

  • P. Du, P. Luszczek, and J. Dongarra.

High performance dense linear system solver with resilience to multiple soft errors. ICCS 2012, jun 2012. [DLSD11]

  • P. Du, P. Luszczek, Tomov S., and J. Dongarra.

Soft error resilient qr factorization for hybrid system with gpgpu. Journal of Computational Science, nov 2011.

herault@icl.utk.edu ABFT for Linear Algebra 66/ 66