Automatic and Coordinated Job Recovery for High Performance - - PowerPoint PPT Presentation

▶

Nov 15, 2023 99 likes •347 views

Automatic and Coordinated Job Recovery for High Performance Computing Wei Tang 1 , Zhiling Lan 1 , Narayan Desai 2 , and Daniel Buettner 2 1 Illinois Insistute of Technology and 2 Argonne National Laboratory Nov 15, 2010 Wei Tang, Zhiling Lan,

SLIDE 1

Automatic and Coordinated Job Recovery for High Performance Computing

Wei Tang1, Zhiling Lan1, Narayan Desai2, and Daniel Buettner2

1Illinois Insistute of Technology and 2Argonne National Laboratory

Nov 15, 2010

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 1 / 24

SLIDE 2

Outline

Motivation System Design Implementation Evaluations

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 2 / 24

SLIDE 3

System Failure and Fault Tolerance

System failures are increasingly common as the scale of supercomputers grows Fault tolerance schemes have been proposed continuously

Redundancy and Replication Checkpoint/Restart Failure prediction + process migration Failure prediction + fault-aware job scheduling

Most of existing fault tolerance schemes are pre-failure avoidance though post-failure handling is equally important.

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 3 / 24

SLIDE 4

Resource management system

Functionality:

Manages the processing load Prevents jobs from competing with each other for limited compute resources

Two parts:

Resource manager: maintains resources, e.g., job queues, computing nodes, etc. Job scheduler: makes scheduling decisions, i.e., when and where to run a job.

Examples: PBS (Altair), Moab (Adaptive Computing), LSF (Platform), LoadLeveler (IBM), Cobalt (ANL)

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 4 / 24

SLIDE 5

Motivation

Fault-tolerance aspect:

Precautionary fault avoidance dont suffice because of inevitability of failures. Post-failure recovery is import, but existing work is few.

Resource management aspect:

Resource manager assumes jobs will run to completion, it hardly support post failure handling. Due to resource limitation, failed jobs should be treated differently according to their diverse importance or priority.

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 5 / 24

SLIDE 6

Our approach

AuCoRe: Automatic and Coordinated job Recovery Extend resource management system to support post-failure handling AuCoRe automatically resubmit failed job in a systematical manner

treating failed jobs with different recovery priority coordinating the failed job recovery with the queuing of regular jobs.

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 6 / 24

SLIDE 7

Design diagram

Figure: Diagram of AuCoRe. Users are allowed to specify their job recovery

ptions in job submission scripts or commands. Jobs are maintained in three

groups, namely the waiting job queue, the running job list, and the failed job

queue. A recovery manager enables automatic and coordinated job recovery and

supports an incentive management mechanism.

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 7 / 24

SLIDE 8

Recovery Options

Specify recovery option by user in the submission script Suggested options:

Option A: notify only Option B: resubmit to rear of the queue Option C: restart the job on original nodes when they are repaired Option D: insert the job in the middle of the queue Option E: resubmit to head of the queue

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 8 / 24

SLIDE 9

Coordinated recovery

Figure: Treatments for failed job with different recovery options. Option-A jobs are stepped out waiting for manually resubmit; option-C jobs are suspended until computing nodes are recovered; Jobs with option B, D, and E are resubmitted to different part of waiting job queues.

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 9 / 24

SLIDE 10

Incentive management

Users behavior is hard to manage:

Ignoring the recovery option Gaming the system by always specifying high options

Intentive mechanism

Users pay for each recovery option with some (virtual) credits at job submission Higher recovery priority costs more credits Credits are prepaid and not returned even no failure occurs. (like insurance) Default to lowest option if not specified

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 10 / 24

SLIDE 11

Incentive mechansism

Pricing: C = αi × T × N

C – the cost for a job with recovery option i αi – the cost for a job with recovery option i: T – the job’s running time (in hour) N – the number of the job’s computing nodes.

User Recovery Account: S = β × T × N

S – Each time a user submits a job, he is assigned a certain amount of credits S β – a parameter set by system owner, ususally, median unit price (Pm)

Charging: B = (αi − β) × T × N

B – actual charge for a job

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 11 / 24

SLIDE 12

Implementation

Figure: AuCoRe Implementation with Cobalt, a production resource management system developed by Argonne National Laboratory.

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 12 / 24

SLIDE 13

Evaluation

Event-driven simulation using Qsim, a job scheduling simulator along with Cobalt resource mananger Uses real job trace from Blue Gene/P system at Argonne National Laboratory Uses synthetic failure events that follow Weibull distribution

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 13 / 24

SLIDE 14

Simulation cases

Cases Denote Description W/O AuCoRe FF failure-free MR failure-present, manual resubmit W/ AuCoRe (multi-opt) Even Option proportion is 1:1:1:1:1 Normal Option proportion is 1:2:4:2:1 W/ AuCoRe (single-opt) All-B all with option B All-C all with option C All-D all with option D All-E all with option E

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 14 / 24

SLIDE 15

Evaluation metrics

Response time (RESP)

a jobs response time is the time from jobs submission to its completion. average among all jobs.

Failure slowdown (FSD)

the ratio of time delay caused by failure to failure-free job execution time. average among failed jobs

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 15 / 24

SLIDE 16

Baseline simulations

Figure: Baseline

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 16 / 24

SLIDE 17

Comparison

Figure: Comparing multi-option cases with single-option ones. The X-axis represents the job groups categorized by their recovery options.

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 17 / 24

SLIDE 18

Multi-option vs single-option

Figure: Comparing multi-option cases with single-option ones. The X-axis represents the job groups categorized by their recovery options.

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 18 / 24

SLIDE 19

Performance under different MTTR

Figure: Performance under different MTTR.

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 19 / 24

SLIDE 20

Performance under different system MTBF

Figure: Performance under different system MTBF.

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 20 / 24

SLIDE 21

Performance under different job arrival rates

Figure: Performance under different job arrival rates.

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 21 / 24

SLIDE 22

Results Summary

AucoRe can significantly improve performance of failed jobs and the

verall system performance.

In the multi-option cases, higher-priority recovery options result in more performance gains than lower-priority options, especially on

FSD. That is, having recovery option diversity can benefit part of jobs

that are really thought important. The recovery performance is sensitive to MTTR. Therefore, when setting the relative unit price of option C, MTTR should be considered. AuCoRe is effective under different system failure rates and job arrival rates.

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 22 / 24

SLIDE 23

Conclusion

Proposed AuCoRe: an automatic and coordinated job recovery framework. Implemented based on Cobalt resource mananger and conducted simulation based experiments. It is the first step in our “Recovery Aware Parallel Computing System (RAPS)” project. Next step we will enhance the post-failure handling with more run-time information from the system perspective.

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 23 / 24

SLIDE 24

Acknowlegement

The work at Illinois Institute of Technology is supported by NSF grants CNS-0834514, CNS-0720549, and CCF-0702737. The work at Argonne National Laboratory is supported by DOE Contract DE-AC02-06CH11357. Thanks for your listening!

Wei Tang, Zhiling Lan, Narayan Desai, and Daniel Buettner (Illinois Insistute of Technology and Argonne National Laboratory) Nov 15, 2010 24 / 24