Resilient Distributed Concurrent Collections Cdric Bassem - - PowerPoint PPT Presentation

resilient distributed concurrent collections
SMART_READER_LITE
LIVE PREVIEW

Resilient Distributed Concurrent Collections Cdric Bassem - - PowerPoint PPT Presentation

Resilient Distributed Concurrent Collections Cdric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche 1 Evolution of Performance in High Performance Computing Exascale = 10 18 Flop/s Petascale = 10 15 Flop/s


slide-1
SLIDE 1

Resilient Distributed Concurrent Collections

Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

1

slide-2
SLIDE 2

Evolution of Performance in High Performance Computing

(source: http://www.top500.org/statistics/perfdevel/)

2

Petascale = 1015 Flop/s Exascale = 1018 Flop/s

slide-3
SLIDE 3

Evolution of Failures in HPC

Main Source: Hardware Faults (~ 50%)

Source: Franck Cappello (2009)

In Exascale SMTTI < 30 min

3

SMTTI = System Mean time to interrupt

slide-4
SLIDE 4

Resilience

“The collection of techniques for keeping applications running to a correct solution in a timely and efficient manner despite underlying system faults”

Snir et al. (2014)

Resilience = Fault Tolerance

Avizienis et al. (2004)

4

slide-5
SLIDE 5

Coordinated Checkpoint/Restart

5

slide-6
SLIDE 6

Asynchronous Checkpoint/Restart

6

slide-7
SLIDE 7

Requirements for Asynchronous Checkpoint/Restart

Reasoning about state: Self-aware, execution frontier Safe restart: Deterministic computation Data race free: Monotonically increasing state

7

slide-8
SLIDE 8

Resilience in CnC

Vrvilo, N. (2014). Asynchronous Checkpoint/Restart for the Concurrent Collections Model (Unpublished master’s thesis). Rice University, Houston, Texas USA.

8

CnC Properties:

  • Dependency graph
  • Provable deterministic computation
  • Single assignment data

Focused on shared memory CnC runtimes

slide-9
SLIDE 9

The Concurrent Collections Model

Tags env Fibs Results

9

1 2

Checkpoint

1 2

slide-10
SLIDE 10

The Concurrent Collections Model

Tags Fibs Results

10

1 2 0:0

Checkpoint

1 2 0:0

slide-11
SLIDE 11

The Concurrent Collections Model

Tags Fibs Results

11

1 2 1 1:1 0:0

Checkpoint

1 2 0:0 1:1

slide-12
SLIDE 12

The Concurrent Collections Model

Tags Fibs Results

12

1 2 2 1:1 0:0

Checkpoint

1 2 0:0 1:1

slide-13
SLIDE 13

The Concurrent Collections Model

13

Checkpoint

1 2 0:0 1:1

Tags Fibs Results

slide-14
SLIDE 14

The Concurrent Collections Model

14

Checkpoint

1 2 0:0 1:1

Tags Fibs Results

slide-15
SLIDE 15

The Concurrent Collections Model

Tags Fibs Results

15

2 2 1:1 0:0

Checkpoint

1 2 0:0 1:1 2:1

slide-16
SLIDE 16

The Concurrent Collections Model

16

env

2:1

Tags Fibs Results

2 1:1 0:0 2:1

Checkpoint

1 2 0:0 1:1

slide-17
SLIDE 17

Proof of Concept Implementation

Goal: Assessing the viability of Asynchronous C/R in distributed memory CnC runtimes

17

Resilience Flavour:

  • Dedicated checkpoint node
  • Fine grained updates
  • Uncoordinated restart

Runtime: Intel(R) Concurrent Collections for C++ (Architect: Frank Schlimbach)

slide-18
SLIDE 18

Dedicated Checkpoint Node & Fine grained Updates

18

Node Node Node Node

Checkpoint

Updates contain: data instances consumed data instances produced control instances produced producers consumers

slide-19
SLIDE 19

Restart

19

Node Node Node Node 1 2 3 4

Restart simulation ➜ No fault tolerant MPI Uncoordinated ➜ Step duplication

slide-20
SLIDE 20

Memory Management in CnC

Non-trivial: data accessed by dynamic steps One solution: get-counting method

20

int getCountFib( FibTag t ) { if ( t > 0 ) { return 2; else { return 1; } }

slide-21
SLIDE 21

Solution

Extra bookkeeping in checkpoint: ➢ Consider steps only once when lowering get counts ○ Hashmap of considered steps ➢ Never re-add removed data instances ○ Marking data as removed

21

slide-22
SLIDE 22

Modelling Overhead (Tw/Ts)

Coordinated Checkpoint/Restart (Daly, 2006) Asynchronous Checkpoint/Restart

22

slide-23
SLIDE 23

Evaluating Asynchronous Checkpoint/Restart

23

slide-24
SLIDE 24

Benchmarks - Goals

Assessing overhead factor (φ): Ok if high Method: Measure w/o resilience = Solve time (Ts) Measure with resilience = Wall clock time (Tw) Overhead factor = Tw/Ts Assessing restart time (Tr): Should be low Method: Measure time needed to calculate the restart set

24

slide-25
SLIDE 25

Number of Steps

Fibonacci Mandelbrot

25

Overhead factor (φ): Increases with number of steps

slide-26
SLIDE 26

Restart Time

26

Fibonacci: Restart Time

Restart Time (Tr): Low Optimization: Shifting some of the complexity to the

  • verhead factor
slide-27
SLIDE 27

Future Work

Distributed Checkpoint: ➢ Overhead high but constant ➢ Restart time?

27

Tag-only logging: ➢ Less communication ➢ Complex restart

Checkpoint

slide-28
SLIDE 28

Conclusion

Asynchronous C/R distributed memory CnC runtime ➢ Analyzing different cases ➢ Proof of concept implementation Asynchronous C/R is viable for systems with low SMTTI ➢ Model ➢ Proof of concept implementation

28

slide-29
SLIDE 29

References

Daly, J. T. (2006). A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps. Future Generation Computer Systems, 22(3), 303–312. Avizienis, A., Laprie, J., Randell, B., & Landwehr, C. E. (2004). Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11–33. Snir, M., Wisniewski, R. W., Abraham, J. A., Adve, S. V., Bagchi, S., Balaji, P., . . . Hensbergen, E. V. (2014). Addressing Failures in Exascale Computing. International Journal of High Performance Computing Applications, 28(2), 129–173. Franck Cappello (2009). Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge. International Journal of High Performance Computing, 23(1), 212-226. Vrvilo, N. (2014). Asynchronous Checkpoint/Restart for the Concurrent Collections Model (Unpublished master’s thesis). Rice University, Houston, Texas USA.

29