Checkpointing HPC Applications
Thomas Ropars thomas.ropars@imag.fr
Universit´ e Grenoble Alpes
2016
1
Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr - - PowerPoint PPT Presentation
Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes 1 2016 Failures in supercomputers Fault tolerance is a serious problem Systems with millions of components Failures cannot be ignored
Thomas Ropars thomas.ropars@imag.fr
Universit´ e Grenoble Alpes
2016
1
Fault tolerance is a serious problem
◮ Due to hardware ◮ Due to software
Analysis of the failures in Blue Waters1
at Petascale: The Case of Blue Waters”. DSN’14.
2016
2
A failure occurs when an error reaches the service interface and alters the service.
Characterization of faults/errors: persistence
◮ Occurs once and disappears ◮ Eg, bit-flip due to high-energy particles
◮ Occurs and does not go away ◮ Eg, a dead power supply 2016
3
Correctness of a fault tolerance techniques has to be validated against a failure model.
The failure model
2016
4
Correctness of a fault tolerance techniques has to be validated against a failure model.
The failure model
We seek for solutions that ensures the correct termination of parallel applications despite crash failures.
2016
4
The basic problem Checkpoint-based protocols Log-based protocols Recent contributions Alternatives to rollback-recovery
2016
5
P0 P1 P2 P3 P4 P5 P6 P7 P1 P0 P2 P3 P4 P5 P6 P7
2016
7
P0 P1 P2 P3 P4 P5 P6 P7 P1 P0 P2 P3 P4 P5 P6 P7
2016
7
P0 P1 P2 P3 P4 P5 P6 P7 P1 P0 P2 P3 P4 P5 P6 P7 Tightly coupled applications
2016
7
A message-passing application
◮ MPI application
2016
8
A message-passing application
◮ MPI application
An asynchronous distributed system
pair of processes
◮ Reliable ◮ FIFO ◮ Ex: TCP, MPI
◮ Unknown bound on message transmission delays ◮ No order between messages on different channels 2016
8
Crash failures
2016
9
Crash failures
Fault tolerance
presence of faults?
◮ The execution should terminate ◮ It should provide the correct result 2016
9
Rollback-recovery (other name)
failure is detected
failure free execution
◮ Transient (soft) error ◮ Use spare resources to replace faulty ones in case of hard error
BER techniques
2016
10
App
ckpt 1 ckpt 2 ckpt 3 ckpt 4
2016
11
App
ckpt 1 ckpt 2 ckpt 3 ckpt 4
2016
11
App
ckpt 1 ckpt 2 ckpt 3 ckpt 4
2016
11
App
ckpt 1 ckpt 2 ckpt 3 ckpt 4
2016
11
App
ckpt 1 ckpt 2 ckpt 3 ckpt 4
Checkpoint data is saved to reliable storage:
reliable storage
2016
11
p0 p1 p2 p3
m0 m1 m2 m3 m4 m5 m6 m5
2016
12
p0 p1 p2 p3
m0 m1 m2 m3 m4 m5 m6 m5
2016
12
p0 p1 p2 p3
m0 m1 m2 m3 m4 m5 m6 m5
2016
12
p0 p1 p2 p3
m0 m1 m2 m3 m4 m5 m6 m5
content)
path
◮ Ensuring a consistent state after the failure is the role of the
rollback-recovery protocol
2016
12
events.
and internal events.
System”. Communications of the ACM (1978).
2016
13
events.
and internal events.
Lamport’s Happened-before relation1
◮ If e, e’ ∈ H(p), then e → e′ or e′ → e
◮ if e → e′ and e′ → e′′, then e → e′′
System”. Communications of the ACM (1978).
2016
13
p0 p1 p2 p3
m0 m1 m2 m3 m4 m5 m6
Happened-before relations:
2016
14
A rollback-recovery protocol should restore the application in a consistent global state after a failure.
failure-free execution
2016
15
A rollback-recovery protocol should restore the application in a consistent global state after a failure.
failure-free execution
Definition
A cut C is consistent iff for all events e and e′: e′ ∈ C and e → e′ = ⇒ e ∈ C
state of the corresponding sender should reflect the sending of that message
2016
15
p0 p1 p2 p3
m0 m1 m2 m3 m4 m5 m6
2016
16
p0 p1 p2 p3
m0 m1 m2 m3 m4 m5 m6
2016
16
p0 p1 p2 p3
m0 m1 m2 m3 m4 m5 m6
Inconsistent recovery line
2016
16
2016
17
checkpoint
2016
18
checkpoint
Incremental checkpointing
the changes
◮ Save storage space ◮ Reduce checkpoint time
◮ Garbage collection = deleting checkpoints that are no longer
useful
2016
18
Application-level checkpointing
The programmer provides the code to save the process state
System-level checkpointing
The process state is saved by an external tool (ex: BLCR)
2016
19
progress
event of a failure Optimal checkpoint frequency depends on:
2016
20
Three categories of techniques
HPC workloads1)
2016
22
Idea
Save checkpoints of each process independently. p0 p1 p2
m0 m1 m2 m3 m4 m5 m6
2016
23
Idea
Save checkpoints of each process independently. p0 p1 p2
m0 m1 m2 m3 m4 m5 m6
2016
23
Idea
Save checkpoints of each process independently. p0 p1 p2
m0 m1 m2 m3 m4 m5 m6
Problem
a failure?
◮ Cascading rollbacks on all processes (unbounded) ◮ If process p1 fails, the only consistent state we can find is the
initial state
2016
23
Implementation
recorded
◮ Data piggybacked on messages and saved in the checkpoints
compute the recovery line
◮ [Bhargava and Lian, 1988] ◮ [Wang, 1993]
Other comments
◮ Hard to decide when a checkpoint is not useful anymore ◮ Many checkpoints may have to be stored 2016
24
Idea
Coordinate the processes at checkpoint time to ensure that the global state that is saved is consistent.
p0 p1 p2
m0 m1 m2 m3 m4 m5 m6
2016
25
Idea
Coordinate the processes at checkpoint time to ensure that the global state that is saved is consistent.
p0 p1 p2
m0 m1 m2 m3 m4 m5 m6
2016
25
Recovery after a failure
◮ Even the non-failed processes have to rollback
process1
◮ In HPC apps: transitive dependencies between all processes
Systems”. ACM Fall joint computer conference. 1986.
2016
26
Other comments
◮ Only the last checkpoint should be kept
◮ What happens when one wants to save the state of all
processes at the same time?
2016
27
Other comments
◮ Only the last checkpoint should be kept
◮ What happens when one wants to save the state of all
processes at the same time?
2016
27
Idea: Take advantage of the structure of the code
synchronization
◮ MPI collective operations
2016
28
Idea
IPDS’98.
2016
29
To ensure consistency
could be received before the destination saved its checkpoint
◮ The process waits for a delay corresponding to the effective
deviation
◮ The effective deviation is computed based on the clock drift
and the message transmission delay
p0 p1
m
ED t(drift)
ED = t(clock drift) − minimum transmission delay
2016
30
p0 p1 p2
checkpoint request a c k a c k
. . . . . .
Checkpoints”.
2016
31
the application and saves a checkpoint, and sends ack to the initiator p0 p1 p2
checkpoint request a c k a c k
. . . . . .
Checkpoints”.
2016
31
the application and saves a checkpoint, and sends ack to the initiator
p0 p1 p2
checkpoint request a c k a c k
. . . . . .
Checkpoints”.
2016
31
the application and saves a checkpoint, and sends ack to the initiator
checkpoint and resumes execution of the application p0 p1 p2
checkpoint request a c k a c k
. . . . . .
Checkpoints”.
2016
31
Correctness
Does the global checkpoint corresponds to a consistent state, i.e., a state with no orphan messages?
2016
32
Correctness
Does the global checkpoint corresponds to a consistent state, i.e., a state with no orphan messages?
Proof sketch (by contradiction)
message m such that: send(m) ∈ C and recv(m) ∈ C
by pj
recv(m) → recvj(ckpt) → recvi(ok) → send(m)
2016
32
Distributed Systems”. ACM Transactions on Computer Systems (1985).
2016
33
p0 p1 p2
initiator m
Distributed Systems”. ACM Transactions on Computer Systems (1985).
2016
33
p0 p1 p2
initiator m
Distributed Systems”. ACM Transactions on Computer Systems (1985).
2016
33
p0 p1 p2
initiator m
p0 p1 p2
initiator
◮ Send a marker to force p2
to save a checkpoint before delivering m
Distributed Systems”. ACM Transactions on Computer Systems (1985).
2016
33
Assuming FIFO channels:
request to all processes
2016
34
Assuming FIFO channels:
request to all processes
checkpoint, and (ii) broadcast checkpoint-request to all. No event can occur between (i) and (ii).
2016
34
Assuming FIFO channels:
request to all processes
checkpoint, and (ii) broadcast checkpoint-request to all. No event can occur between (i) and (ii).
process deletes its old checkpoint
2016
34
Idea: Logging the messages exchanged during failure free execution to be able to replay them in the same order after a failure
3 families of protocols
2016
36
The execution of a process is a set of deterministic state intervals, each started by a non-deterministic event.
message receptions p
i state interval i-1 i+1 state interval i+2
From a given initial state, playing the same sequence of messages will always lead to the same final state.
2016
37
Basic idea
the log
Consistent state
follows the same execution path after the failure
◮ Other processes do not roll back. They wait for the failed
process to catch up
2016
38
What is logged?
◮ Sender id ◮ Sender sequence number ◮ Receiver id ◮ Receiver sequence number 2016
39
Sender-based message logging1
recovery
Event logging
The 17th Annual International Symposium on Fault-Tolerant Computing. 1987.
2016
40
Important
remote synchronization The 3 protocol families correspond to different ways of managing determinants.
2016
41
An orphan message is a message that is seen has received, but whose sending state interval cannot be recovered. p0 p1 p2 p3
m0 m1 m2
If the determinants of messages m0 and m1 have not been saved, then message m2 is orphan.
Optimal”. IEEE Transactions on Software Engineering (1998).
2016
42
determinant of e in their memory
logged on a reliable storage To avoid orphans: ∀e : ¬Stable(e) ⇒ Depend(e) ⊆ Log(e)
2016
43
Failure-free protocol
synchronously on reliable storage ∀e : ¬Stable(e) ⇒ |Depend(e)| = 1 p EL
det ack sending delay
Recovery
2016
44
Failure-free protocol
asynchronously (periodically) on reliable storage p EL
det ack risk of orphan
Recovery
rollback
failure-free execution
2016
45
Failure-free protocol
”always-no-orphan” condition
application messages until they are saved on reliable storage p
[det]
Recovery
2016
46
Failure-free performance
costly
Recovery performance
complex
2016
47
Message logging is combined with checkpointing
Which checkpointing protocol?
◮ No risk of domino effect
2016
48
Coordinated checkpointing
checkpoint/restart at the same time
◮ More than 50% of wasted time?1 ◮ Solution: see multi-level checkpointing
failure is a big waste of resources
Next-Generation Systems”. MSST 2007.
2016
50
Message logging
◮ Running a climate simulation (CM1) on 512 processes
generates > 1GB/s of logs1
◮ Frequent synchronization with a reliable storage has a high
◮ Piggybacking information on messages penalizes
communication performance
Applications for Scalable Checkpointing”. SuperComputing 2013.
2016
51
Optimistic ML and coordinated checkpointing are combined
Optimistic message logging
restart Coordinated checkpointing
last checkpoint
◮ Case of the failure of an event logger ◮ No complex recovery protocol
SuperComputing 2012.
2016
52
Idea
non-deterministic events
◮ Discriminating deterministic communication events will
improve event logging efficiency
Impact
Performance”. Concurrency and Computation : Practice and Experience (2010).
2016
53
MPI_Isend(m,req1) MPI_Irecv(req2) Packet 1 Packet 2 ... Packet n post(req2) match(req2,m) MPI_Wait(req1) complete(req2) MPI_Wait(req2) P1 MPI Library MPI Library P2 send(m) deliver(m)
New execution model
2 events associated with each message reception:
◮ Not deterministic only if ANY SOURCE is used
in the user buffer
◮ Not deterministic only for wait any/some and test functions 2016
54
The application processes are grouped in logical clusters
P P P P P P P P P P P P P P
Logging Protocols”. Euro-Par’11.
2016
55
The application processes are grouped in logical clusters Failure-free execution
checkpoints inside clusters periodically
P P P P P P P P P P P P P P
Logging Protocols”. Euro-Par’11.
2016
55
The application processes are grouped in logical clusters Failure-free execution
checkpoints inside clusters periodically
P P P P P P P P P P P P P P
Logging Protocols”. Euro-Par’11.
2016
55
The application processes are grouped in logical clusters Failure-free execution
checkpoints inside clusters periodically
Recovery
P P P P P P P P P P P P P P
Logging Protocols”. Euro-Par’11.
2016
55
The application processes are grouped in logical clusters Failure-free execution
checkpoints inside clusters periodically
Recovery
from the last checkpoint
P P P P P P P P P P P P P P
Logging Protocols”. Euro-Par’11.
2016
55
The application processes are grouped in logical clusters Failure-free execution
checkpoints inside clusters periodically
Recovery
from the last checkpoint
messages from the logs
P P P P P P P P P P P P P P
Logging Protocols”. Euro-Par’11.
2016
55
Advantages
◮ But the determinant of all messages should be logged1
◮ Failure containment2
Logging Protocols”. Euro-Par’11.
Resilience Scheme for Exascale Systems”. SuperComputing 2012.
2016
56
10 20 30 40 50 60 10 20 30 40 50 60 Receiver Rank Sender Rank MiniFE - 64 processes - Pb size: 200x200x200 500000 1e+06 1.5e+06 2e+06 2.5e+06 3e+06 3.5e+06 4e+06 4.5e+06 Amount of Data in Bytes
Good applicability to most HPC workloads1
Improve Fault Tolerance for MPI HPC Applications”. Euro-Par’11.
2016
57
Non-deterministic algorithm
influenced by non-deterministic events
Applications”. ICCCN 2010.
2016
58
Non-deterministic algorithm
influenced by non-deterministic events
Send-deterministic algorithm
and for any process p, the sequence of send events on p is the same in any valid execution of A.
Applications”. ICCCN 2010.
2016
58
Non-deterministic algorithm
influenced by non-deterministic events
Send-deterministic algorithm
and for any process p, the sequence of send events on p is the same in any valid execution of A.
Applications”. ICCCN 2010.
2016
58
The relative order of the messages received by a process has no impact on its execution. p0 p1 p2
m1 m2 m3
Effect for Send-Deterministic Message Passing Applications”. IPDPS2011.
2016
59
The relative order of the messages received by a process has no impact on its execution. p0 p1 p2
m1 m2 m3
It is possible to design an uncoordinated checkpointing protocol that has no risk of domino effect1.
Effect for Send-Deterministic Message Passing Applications”. IPDPS2011.
2016
59
For send-deterministic MPI applications that do not include ANY SOURCE receptions:
Applications for Scalable Checkpointing”. SuperComputing 2013.
2016
60
For send-deterministic MPI applications that do not include ANY SOURCE receptions:
For applications including ANY SOURCE receptions:
Applications for Scalable Checkpointing”. SuperComputing 2013.
2016
60
Idea
failures
◮ Coverage of 50% ◮ Precision of 90%
actions
◮ Save a checkpoint before the failure occurs
Systems Using a Combination of Proactive and Preventive Checkpointing”. IPDPS’13.
2016
62
rank 0 rank 1 rank 2 rank 3
P0 P1 P0
1
P1
1
P0
2
P1
2
P0
3
P1
3
synch synch synch synch
P0
1
for Exascale Systems”. SuperComputing 2011.
2016
63
rank 0 rank 1 rank 2 rank 3
P0 P1 P0
1
P1
1
P0
2
P1
2
P0
3
P1
3
synch synch synch synch
P0
1
for Exascale Systems”. SuperComputing 2011.
2016
63
rank 0 rank 1 rank 2 rank 3
P0 P1 P0
1
P1
1
P0
2
P1
2
P0
3
P1
3
synch synch synch synch
P0
1
for Exascale Systems”. SuperComputing 2011.
2016
63
rank 0 rank 1 rank 2 rank 3
P0 P1 P0
1
P1
1
P0
2
P1
2
P0
3
P1
3
synch synch synch synch
P0
1
for Exascale Systems”. SuperComputing 2011.
2016
63
In the crash failure model
◮ It is actually possible to do better!
applications It could be of interest to deal with silent errors
2016
64
Idea
◮ Maintain the redundancy during the computation
the redundant information
◮ Minimal amount of replicated data ◮ No rollback 2016
65
Context: Evolution of the MPI standard (fault tolerance working group)
Idea
◮ The application continues to run after a crash
level after a failure:
◮ Failure notifications ◮ Checking the status of components ◮ Reconfiguring the application 2016
66
◮ Reference: Survey by Elnozahy et al1
◮ Adapted to extreme scale supercomputers and applications
Message-Passing Systems”. ACM Computing Surveys 34.3 (2002),
2016
67