MCEM: Multi-Level Cooperative Exception Model for HPC Workflows - - PowerPoint PPT Presentation

mcem multi level cooperative exception model for hpc
SMART_READER_LITE
LIVE PREVIEW

MCEM: Multi-Level Cooperative Exception Model for HPC Workflows - - PowerPoint PPT Presentation

MCEM: Multi-Level Cooperative Exception Model for HPC Workflows Stephen Herbein , David Domyancic, Paul Minner, Ignacio Laguna, Rafael Ferreira da Silva , Dong H. Ahn June, 2019 LLNL-PRES-779103 This work was performed under the auspices of the


slide-1
SLIDE 1

LLNL-PRES-779103

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE- AC52-07NA27344. Lawrence Livermore National Security, LLC

MCEM: Multi-Level Cooperative Exception Model for HPC Workflows

Stephen Herbein, David Domyancic, Paul Minner, Ignacio Laguna, Rafael Ferreira da Silva , Dong H. Ahn June, 2019

slide-2
SLIDE 2

2

LLNL-PRES-779103

Fault-Tolerance in HPC Fault-Tolerance is becoming increasingly important

§ The MTBF of our systems is

shrinking

§ The cost of checkpoint/restart

is becoming prohibitively expensive

§ The problem will only get

worse with the inclusion of GPUs and node-local SSDs

[1] R. Riesen, K. Ferreira and J. Stearley, "See applications run and throughput jump: The case for redundant computing in HPC," 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W), Chicago, IL, 2010, pp. 29-34

[1]

slide-3
SLIDE 3

3

LLNL-PRES-779103

Fault-Tolerance Primitives

§ Detection

— the observation of a fault, error, or degradation

§ Isolation/Diagnosis

— the identification of the root cause of the detected fault

§ Recovery

— the remediation of the fault by affected components

slide-4
SLIDE 4

4

LLNL-PRES-779103

Fault Tolerance: State of the Practice

§ Existing State of the Practice

fault tolerance techniques are entirely uncoordinated

§ System components each act

independently to detect, diagnose, and recover from faults

Scheduler Parallel Job Node Resource Manager User’s Workflow Manager Relaunch Process Restart with N-1 Nodes SegFault

Lack of coordination results in undetected faults and inefficiency

Process Exited Abnormally Rank Unresponsive Node Failed SegFault Detection Recovery Diagnosis

slide-5
SLIDE 5

5

LLNL-PRES-779103

Fault Tolerance: State of the Art

§ Components coordinate to

detect and diagnose faults

§ System components each

perform their own uncoordinated recovery actions

§ These actions are usually

redundant and sometimes contradictory

Scheduler Parallel Job Resource Manager User’s Workflow Manager Resubmit Job Relaunch Process Restart with N-1 Nodes Kill Job Global Event Database

ML Model

Lack of coordinated recovery results in suboptimal and redundant work

Node SegFault Process Failure

  • n Node X

Process Exited Abnormally Rank Unresponsive Detection Recovery Diagnosis

slide-6
SLIDE 6

6

LLNL-PRES-779103

MCEM: Multi-Level Cooperative Exception Model

Scheduler Parallel Job Resource Manager User’s Workflow Manager Relaunch Process Restart with N Nodes Global Event Database

ML Model

Node SegFault Process Failure

  • n Node X

Detection Recovery Diagnosis Process Exited Abnormally Rank Unresponsive

§ MCEM extends the idea of

C++/Java exceptions to an entire HPC system

§ Exceptions are cooperatively

handled in a chain

§ Chained exceptions include

fault and recovery metadata

Extend Job Walltime

slide-7
SLIDE 7

7

LLNL-PRES-779103

MCEM: Global Exceptions

Scheduler Parallel Job Resource Manager User’s Workflow Manager Transfer jobs to 2nd PFS Global Event Database

ML Model

Node Parallel FS Down Detection Recovery Diagnosis IO Timeouts

§ Propagating up works well for

exceptions originating from a single, isolated resource (i.e., local exception)

§ Reverse propagation direction

for exceptions originating from a shared resource (i.e., global exception)

Hold Jobs Requiring PFS Parallel Filesystem Metadata Node Failed

slide-8
SLIDE 8

8

LLNL-PRES-779103

MCEM: Fault Model

§ Hard faults

— Segmentation Faults, Node Failures, Network Link

Failure, PFS Down, User Exceeded Disk Quota

§ Soft faults

— Network or PFS performance degraded, User

Approaching Disk Quota

§ Fault length

— Effects must last long enough to be reliably detected,

isolated, and recovered from – O(minutes)

slide-9
SLIDE 9

9

LLNL-PRES-779103

MCEM Exception Recovery Examples

Failure Type Resource Manager Parallel Job Workflow Manager Scheduler

Parallel Launcher Failure

  • Retry job (transient)

Log system error (permanent)

  • Application Failure

(i.e., mesh tangling)

  • Launch mesh

relaxation job

  • Process Failure

Relaunch Process Restart w/ N ranks

  • Grant job addt’l time

Node Failure Mark node down Restart w/ N-1 ranks OR req addt’l node

  • Grant job addt’l node

User Approaching or Exceeding Disk Quota

  • Migrate some/all

workflow jobs to secondary filesystem Hold queued jobs requiring PFS access

slide-10
SLIDE 10

10

LLNL-PRES-779103

Quota Exceeded: State of the Practice

Scheduler Parallel Job Resource Manager User’s Workflow Manager Migrate to 2nd PFS Parallel Filesystem User Exceeded Quota User Above Hard Quota EQUOT Detection Recovery Diagnosis

slide-11
SLIDE 11

11

LLNL-PRES-779103

Quota Exceeded: State of the Art

Scheduler Parallel Job Resource Manager User’s Workflow Manager Migrate Some Jobs to 2nd PFS Migrate to 2nd PFS Hold User’s Queued Jobs Global Event Database

ML Model

Parallel Filesystem User Exceeded Quota User Above Hard Quota EQUOT Detection Recovery Diagnosis

slide-12
SLIDE 12

12

LLNL-PRES-779103

Quota Exceeded: MCEM

Scheduler Parallel Job Resource Manager User’s Workflow Manager Global Event Database

ML Model

Node Detection Recovery Diagnosis Hold User’s Queued Jobs Parallel Filesystem User Exceeded Quota EQUOT User Above Hard Quota Migrate Some Jobs to 2nd PFS

slide-13
SLIDE 13

13

LLNL-PRES-779103

Evaluation

§ In SOA, parallel applications all

transition to 2nd filesystem, and the WFM re-transitions some/all

  • f the jobs

§ MCEM allows the WFM to only

move the minimal subset of jobs exactly once

MCEM can reduce IO by up to 90%

slide-14
SLIDE 14

14

LLNL-PRES-779103

Implementation: Resource Manager

§ Why to implement within the system RM

—Communication already implemented and fault-tolerant

(hopefully)

—Can be a plugin/module, result in less code to write and

audit

§ Why not to implement within the system RM

—If the RM daemon dies, so does MCEM —RM failures then become potentially undetectable and

certainly unrecoverable

slide-15
SLIDE 15

15

LLNL-PRES-779103

Implementation: Runtime Interface

§ Flux

— flux job raise –severity=1 –type=“segmentation fault” $ID ’{“rank”: “262”,

“pid”: 1182, “node”: ”quartz454”}’

— flux job eventlog $ID — flux_event_subscribe (h, "job-exception")

§ PMIx

— PMIx_Notify_event — PMIx_Register_event_handler

  • Supports registering a handler for multiple events, simultaneously
  • “Multi-code” handlers always execute after “single-code” handlers
  • Supports specifying relative handler precedence within a “category”
slide-16
SLIDE 16

16

LLNL-PRES-779103

Acknowledgements

§ Co-Authors

— David Domyancic — Paul Minner — Ignacio Laguna — Rafael Ferreira da Silva — Dong H. Ahn

§ Flux Team

— Ned Bass — Al Chu — Jim Garlick — Mark Grondona — Tapasya Patki — Tom Scogland — Becky Springmeyer

slide-17
SLIDE 17

Disclaimer This document was prepared as an account of work sponsored by an agency of the United States government. Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes.

slide-18
SLIDE 18

18

LLNL-PRES-779103

Backup Slides

slide-19
SLIDE 19

19

LLNL-PRES-779103

MCEM’s Exception Propagation Order

Local Exceptions Global Exceptions