On the Combination of Silent Error Detection and Checkpointing - - PowerPoint PPT Presentation

on the combination of silent error detection and
SMART_READER_LITE
LIVE PREVIEW

On the Combination of Silent Error Detection and Checkpointing - - PowerPoint PPT Presentation

On the Combination of Silent Error Detection and Checkpointing Guillaume Aupy, Anne Benoit, Thomas H erault, Yves Robert, Fr ed eric Vivien & Dounia Zaidouni PRDC 2013 Silent error detection 1 Introduction, motivation G. Aupy


slide-1
SLIDE 1

On the Combination of Silent Error Detection and Checkpointing

Guillaume Aupy, Anne Benoit, Thomas H´ erault, Yves Robert, Fr´ ed´ eric Vivien & Dounia Zaidouni PRDC 2013

slide-2
SLIDE 2

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

1.0

1 Introduction, motivation 2 Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

3 Limited resources 4 Incorporating detection

k checkpoints for 1 verification k verifications for 1 checkpoint

5 Conclusion, future work 6 Announcement

slide-3
SLIDE 3

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

2.0

A few definitions

  • Many types of faults: software error, hardware

malfunction, memory corruption

  • Many possible behaviors: transient, unrecoverable, silent
  • Restrict to silent errors
  • This includes some software faults, some hardware errors

(soft errors in L1 cache), double bit flip

  • Silent error detected when corrupt data is activated
slide-4
SLIDE 4

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

2.0

A few definitions

  • Many types of faults: software error, hardware

malfunction, memory corruption

  • Many possible behaviors: transient, unrecoverable, silent
  • Restrict to silent errors
  • This includes some software faults, some hardware errors

(soft errors in L1 cache), double bit flip

  • Silent error detected when corrupt data is activated
  • Silent errors are the black swans of errors (Marc Snir)
slide-5
SLIDE 5

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

3.0

Error sources (courtesy Franck Cappello)

  • Analysis of error and failure logs
  • In 2005 (Ph. D. of CHARNG-DA LU) : “Software halts account for the most number of
  • utages (59-84 percent), and take the shortest time to repair (0.6-1.5 hours). Hardware

problems, albeit rarer, need 6.3-100.7 hours to solve.”

  • In 2007 (Garth Gibson, ICPP Keynote):
  • In 2008 (Oliner and J. Stearley, DSN Conf.):

50%

Hardware

Conclusion: Both Hardware and Software failures have to be considered

Software errors: Applications, OS bug (kernel panic), communication libs, File system error and other. Hardware errors, Disks, processors, memory, network

slide-6
SLIDE 6

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

4.0

1 Introduction, motivation 2 Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

3 Limited resources 4 Incorporating detection

k checkpoints for 1 verification k verifications for 1 checkpoint

5 Conclusion, future work 6 Announcement

slide-7
SLIDE 7

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

5.0

Time Xe Xd Error Detection

Figure : Error and detection latency.

  • Xe inter arrival time between errors; mean time µe
  • Xd error detection time; mean time µd
  • Assume Xd and Xe independent
slide-8
SLIDE 8

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

6.0

Notations

  • C checkpointing time
  • R recovery time
  • W total work
  • w some piece of work
slide-9
SLIDE 9

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

7.0

For one chunk

When Xe follows an Exponential law of parameter λe =

1 µe , in

  • rder to execute a total work of w + C, we need:
slide-10
SLIDE 10

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

7.0

For one chunk

When Xe follows an Exponential law of parameter λe =

1 µe , in

  • rder to execute a total work of w + C, we need:
  • Probability of execution without error

E(T(w)) = e−λe(w+C) (w + C) + (1 − e−λe(w+C)) (E(Tlost) + E(Xd) + E(Trec) + E(T(w)))

slide-11
SLIDE 11

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

7.0

For one chunk

When Xe follows an Exponential law of parameter λe =

1 µe , in

  • rder to execute a total work of w + C, we need:
  • Probability of execution without error

E(T(w)) = e−λe(w+C) (w + C) + (1 − e−λe(w+C)) (E(Tlost) + E(Xd) + E(Trec) + E(T(w)))

  • Probability of error during w + C
slide-12
SLIDE 12

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

7.0

For one chunk

When Xe follows an Exponential law of parameter λe =

1 µe , in

  • rder to execute a total work of w + C, we need:
  • Probability of execution without error

E(T(w)) = e−λe(w+C) (w + C) + (1 − e−λe(w+C)) (E(Tlost) + E(Xd) + E(Trec) + E(T(w)))

  • Probability of error during w + C
  • Execution time with an error
slide-13
SLIDE 13

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

8.0

Let us focus on the time lost due to an error: E(Tlost) + E(Xd) + E(Trec)

slide-14
SLIDE 14

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

8.0

This is the time elapsed between the completion of the last checkpoint and the error E(Tlost) = ∞ xP(X = x|X < w + C)dx = 1 P(X < w + C) w+C xλee−λexdx = 1 λe − w + C eλe(w+C) − 1 Let us focus on the time lost due to an error: E(Tlost) + E(Xd) + E(Trec)

slide-15
SLIDE 15

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

8.0

This is the time needed for error detection, E(Xd) = µd Let us focus on the time lost due to an error: E(Tlost) + E(Xd) + E(Trec)

slide-16
SLIDE 16

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

8.0

This is the time to recover from the error (there can be a fault durnig recovery): E(Trec) = e−λeRR + (1 − e−λeR)(E(Rlost) + E(Xd) + E(Trec)) Let us focus on the time lost due to an error: E(Tlost) + E(Xd) + E(Trec)

slide-17
SLIDE 17

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

8.0

This is the time to recover from the error (there can be a fault durnig recovery): E(Trec) = e−λeRR + (1 − e−λeR)(E(Rlost) + E(Xd) + E(Trec)) Similarly to E(Tlost), we have: E(Rlost) =

1 λe − R eλe R−1.

Let us focus on the time lost due to an error: E(Tlost) + E(Xd) + E(Trec)

slide-18
SLIDE 18

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

8.0

This is the time to recover from the error (there can be a fault durnig recovery): E(Trec) = e−λeRR + (1 − e−λeR)(E(Rlost) + E(Xd) + E(Trec)) Similarly to E(Tlost), we have: E(Rlost) =

1 λe − R eλe R−1.

So finally, E(Trec) = (eλeR − 1)(µe + µd) Let us focus on the time lost due to an error: E(Tlost) + E(Xd) + E(Trec)

slide-19
SLIDE 19

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

8.0

At the end of the day, E(T(w)) = eλeR (µe + µd) (eλe(w+C) − 1) This is the exact solution!

slide-20
SLIDE 20

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

9.0

For multiple chunks

Using n chunks of size wi (with n

i=1 wi = W ), we have:

E(T(W )) = K

n

  • i=1

(eλe(wi+C) − 1) with K constant. Independent of µd! Minimum when all the wi’s are equal to w = W /n.

slide-21
SLIDE 21

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

9.0

For multiple chunks

Using n chunks of size wi (with n

i=1 wi = W ), we have:

E(T(W )) = K

n

  • i=1

(eλe(wi+C) − 1) with K constant. Independent of µd! Minimum when all the wi’s are equal to w = W /n. Optimal n can be found by differentiation A good approximation is w = √2µeC (Young’s formula)

slide-22
SLIDE 22

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

10.0

Arbitrary distributions

Extend results when Xe follows an arbitrary distribution of mean µe

slide-23
SLIDE 23

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

11.0

Framework

Waste: fraction of time not spent for useful computations

slide-24
SLIDE 24

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

12.0

Waste

  • Timebase: application base time
  • TimeFF: with periodic

checkpoints but failure-free

  • TimeFinal: expectation of time

with failures

Timebase TimeFF TimeFinal TimeFF × WasteFF TimeFinal × WasteFail

(1 − WasteFF)TimeFF = Timebase (1 − WasteFail)TimeFinal = TimeFF Waste = TimeFinal−Timebase

TimeFinal

Waste = 1 − (1 − WasteFF)(1 − WasteFail)

slide-25
SLIDE 25

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

13.0

Back to our model

We can show that WasteFF = C T WasteFail =

T 2 + R + µd

µe

slide-26
SLIDE 26

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

13.0

Back to our model

We can show that WasteFF = C T WasteFail =

T 2 + R + µd

µe Only valid if T

2 + R + µd ≪ µe.

slide-27
SLIDE 27

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

13.0

Back to our model

We can show that WasteFF = C T WasteFail =

T 2 + R + µd

µe Only valid if T

2 + R + µd ≪ µe.

Then the waste is minimized for Topt =

  • 2(µe − (R + µd))C) ≈ √2µeC
slide-28
SLIDE 28

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

14.0

Summary

Theorem

  • Best period is Topt ≈ √2µeC
  • Independent of Xd
slide-29
SLIDE 29

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

15.0

Limitation of this model

Analytical optimal solutions, valid for arbitrary distributions, without any knowledge on Xd except its mean However, if Xd can be arbitrary large:

  • Do not know how far to roll back in time
  • Need to store all checkpoints taken during execution
slide-30
SLIDE 30

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

16.0

1 Introduction, motivation 2 Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

3 Limited resources 4 Incorporating detection

k checkpoints for 1 verification k verifications for 1 checkpoint

5 Conclusion, future work 6 Announcement

slide-31
SLIDE 31

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

17.0

The case with limited resources

Assume that we can only save the last k checkpoints

Definition (Critical failure)

Error detected when all checkpoints contain corrupted data. Happens with probability Prisk during whole execution.

slide-32
SLIDE 32

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

17.0

The case with limited resources

Prisk decreases when T increases (when Xd is fixed). Hence, Prisk ≤ ε leads to a lower bound Tmin on T We have derived an analytical form for Prisk when Xd follows an Exponential law. We use it as a good(?) approximation for arbitrary laws

slide-33
SLIDE 33

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

18.0

Figure : k = 3, λe = 105

100y , λd = 30λe, w = 10d, C = R = 600s

Topt ≈ 100min, Prisk(Topt) ≈ 38 · 10−5, for a waste of 23.45% To reduce Prisk to 10−4, a Tmin of 8000 seconds is sufficient, increasing the waste by only 0.6%. In this case, the benefit of fixing the period to max(Topt, Tmin) is obvious

slide-34
SLIDE 34

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

18.0

More optimistic technologic scenario (smaller C and R): Topt is largely reduced (down to less than 35 minutes), but Prisk(Topt) climbs to 1/2, an unacceptable value. To reduce Prisk to 10−4, it becomes necessary to consider a Tmin of 6650 seconds. The waste increases to 15%, significantly higher than the optimal one, which is below 10%

Figure : k = 3, λe =

105 100y , λd = 30λe, w = 10d, C = R = 60s.

slide-35
SLIDE 35

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

18.0

Figure : k = 3, λe = 105

100y , λd = 30λe, w = 10d, C = R = 600s

Figure : k = 3, λe =

105 100y , λd = 30λe, w = 10d, C = R = 60s.

slide-36
SLIDE 36

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

19.0

Limitation of the model

It is not clear how can one detect when the error occurred (hence to identify the last valid checkpoint) Need a verification mechanism to check the correctness of the

  • checkpoints. This has a cost!

Possible solution: add verifications; use a periodic mechanism to verify that there were no silent errors in previous computations.

slide-37
SLIDE 37

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

20.0

1 Introduction, motivation 2 Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

3 Limited resources 4 Incorporating detection

k checkpoints for 1 verification k verifications for 1 checkpoint

5 Conclusion, future work 6 Announcement

slide-38
SLIDE 38

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

21.0

With detection

Assume there are no errors during checkpoints (less error sources when doing I/O) Simple approach: Perform a verification before each checkpoint to eliminate risk of corrupted data.

slide-39
SLIDE 39

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

22.0

Motivational Examples

R = 0: WasteFF =

V +C w+V +C , WasteFail = w µe

Time

w V C w V C w V C w V C

When V is large compared to w, WasteFF is large, can we improve that?

slide-40
SLIDE 40

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

22.0

Motivational Examples

R = 0: WasteFF =

V +C w+V +C , WasteFail = w µe

Time

w V C w V C w V C w V C

When V is large compared to w, WasteFF is large, can we improve that? Is this better?

Time

w C w V C w C w V C w C

slide-41
SLIDE 41

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

22.0

Motivational Examples

R = 0: WasteFF =

V +C w+V +C , WasteFail = w µe

Time

w V C w V C w V C w V C

When V is small in front of w, WasteFail is large, can we improve that?

slide-42
SLIDE 42

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

22.0

Motivational Examples

R = 0: WasteFF =

V +C w+V +C , WasteFail = w µe

Time

w V C w V C w V C w V C

When V is small in front of w, WasteFail is large, can we improve that? Is this better?

Time

w/2 V w/2 V C w/2 V w/2 V C w/2 V w/2 V C w/2 V

slide-43
SLIDE 43

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

23.0

k checkpoints for 1 verification

Time V C w C w C w C w C w V C

With multiple checkpoints, the problem is to find when the error occurred.

Time Error

V C w C w C w C w C w V

slide-44
SLIDE 44

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

23.0

k checkpoints for 1 verification

Time V C w C w C w C w C w V C

With multiple checkpoints, the problem is to find when the error occurred.

Time Error

V C w C w C w C w C w V R V

slide-45
SLIDE 45

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

23.0

k checkpoints for 1 verification

Time V C w C w C w C w C w V C

With multiple checkpoints, the problem is to find when the error occurred.

Time Error

V C w C w C w C w C w V R V R V

slide-46
SLIDE 46

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

23.0

k checkpoints for 1 verification

Time V C w C w C w C w C w V C

With multiple checkpoints, the problem is to find when the error occurred.

Time Error

V C w C w C w C w C w V R V R V R V

slide-47
SLIDE 47

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

23.0

k checkpoints for 1 verification

Time V C w C w C w C w C w V C

WasteFF = kC + V k(w + C) + V WasteFail =

1 k

k

i=1 Tlost(i)

µe where Tlost(i) is the time lost if error occurred in ith segment

slide-48
SLIDE 48

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

23.0

k checkpoints for 1 verification

Time V C w C w C w C w C w V C

Time Error

V C w C w C w C w C w V

Tlost(k) = R + V + w + V

slide-49
SLIDE 49

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

23.0

k checkpoints for 1 verification

Time V C w C w C w C w C w V C

Time Error

V C w C w C w C w C w V R V

Tlost(k) = R + V + w + V

slide-50
SLIDE 50

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

23.0

k checkpoints for 1 verification

Time V C w C w C w C w C w V C

Time Error

V C w C w C w C w C w V R V w V

Tlost(k) = R + V + w + V

slide-51
SLIDE 51

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

23.0

k checkpoints for 1 verification

Time V C w C w C w C w C w V C

Time Error

V C w C w C w C w C w V R V w V C

Tlost(k) = R + V + w + V

slide-52
SLIDE 52

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

23.0

k checkpoints for 1 verification

Time V C w C w C w C w C w V C

Time Error

V C w C w C w C w C w V R V w V C

Tlost(k) = R + V + w + V Tlost(i) = (k − i + 1)(R + V + w) + (k − i)C + V

slide-53
SLIDE 53

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

23.0

k checkpoints for 1 verification

Time V C w C w C w C w C w V C

Time Error

V C w C w C w C w C w V R V w V C

Tlost(k) = R + V + w + V Tlost(i) = (k − i + 1)(R + V + w) + (k − i)C + V Tlost(1) = k(R + V + w) − V + (k − 1)C + V

slide-54
SLIDE 54

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

23.0

k checkpoints for 1 verification

Time V C w C w C w C w C w V C

Time Error

V C w C w C w C w C w V R V w V C

Tlost(k) = R + V + w + V Tlost(i) = (k − i + 1)(R + V + w) + (k − i)C + V Tlost(1) = k(R + V + w) − V + (k − 1)C + V And this leads us to optimal solution . . .

slide-55
SLIDE 55

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

24.0

Figure : V = 100s, C = R = 6s, and µ = 10y

105 .

C = 6s ≪ V . When V = 100 seconds, a verification is done only every k = 3 checkpoints optimally ⇒ 10% improvement compared to k = 1.

slide-56
SLIDE 56

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

24.0

C = 60s is not negligible anymore in front of V (V ≈ 5C). The waste is dominated by the cost of verification, and little improvement can be achieved by taking the optimal value for k.

Figure : V = 300s, C = R = 60s, and µ = 10y

105 .

slide-57
SLIDE 57

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

24.0

Figure : V = 100s, C = R = 6s, and µ = 10y

105 .

Figure : V = 300s, C = R = 60s, and µ = 10y

105 .

slide-58
SLIDE 58

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

25.0

k verifications for 1 checkpoint

Time V C w V w V w V w V w V C

Very similarly, we obtain: WasteFF = kV + C k(w + V ) + C WasteFail =

1 k

k

i=1 Tlost(i)

µe Tlost(i) = R + i(V + w) where Tlost(i) is the time lost if error occurred in ith segment.

slide-59
SLIDE 59

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

26.0

Figure : V = 20s, C = R = 600s, and µ = 10y

105 .

V = 20s ≪ C. When C = 600 seconds, 5 verifications are done for every check- point optimally ⇒ 14% improvement compared to k = 1.

slide-60
SLIDE 60

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

26.0

V = 2s ≪ C. When C = 60 seconds, 5 verifications are done every checkpoint

  • ptimally ⇒ 18% improvement compared to k = 1.

Figure : V = 2s, C = R = 60s, and µ = 10y

105 .

slide-61
SLIDE 61

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

26.0

Figure : V = 20s, C = R = 600s, and µ = 10y

105 .

Figure : V = 2s, C = R = 60s, and µ = 10y

105 .

slide-62
SLIDE 62

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

27.0

1 Introduction, motivation 2 Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

3 Limited resources 4 Incorporating detection

k checkpoints for 1 verification k verifications for 1 checkpoint

5 Conclusion, future work 6 Announcement

slide-63
SLIDE 63

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

28.0

Conclusion

  • Study of optimal checkpointing strategy in presence of

silent errors

  • Analytical solution for the different probability distributions
  • Study in presence of verification mechanisms
slide-64
SLIDE 64

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

29.0

Future work

  • Without verification: When we keep k checkpoints in

memory, we do not have to keep the k last checkpoints: new strategies (Fibonacci, binary, . . . )?

  • With verification: We focused on an integer number of

checkpoints per verification (or conversely): extensions?

slide-65
SLIDE 65

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

30.0

1 Introduction, motivation 2 Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

3 Limited resources 4 Incorporating detection

k checkpoints for 1 verification k verifications for 1 checkpoint

5 Conclusion, future work 6 Announcement

slide-66
SLIDE 66

Silent error detection

  • G. Aupy

Introduction, motivation Optimal Checkpointing strategy

Exponential distribution Arbitrary distribution

Limited resources Incorporating detection

k checkpoints

for 1 verification

k verifications

for 1 checkpoint

Conclusion, future work Announcement

Hockey

“The local junior hockey team, the Vancouver Giants, offer a cheaper but no less exciting experience. They play out of Pacific Coliseum in East Van.” – WikiVoyage, Vancouver. There is a game Wednesday at 7pm (PRDC finishes at noon). I will try to go, tickets are 19.25$ or 23.50$. Contact me if you are interested :).