Spare Node Substitution for Failure Nodes Kazumi Yoshinaga RIKEN - - PowerPoint PPT Presentation

spare node substitution for failure nodes
SMART_READER_LITE
LIVE PREVIEW

Spare Node Substitution for Failure Nodes Kazumi Yoshinaga RIKEN - - PowerPoint PPT Presentation

Spare Node Substitution for Failure Nodes Kazumi Yoshinaga RIKEN AICS Background In the Exa-flops era, faults could happen more frequently than ever System MTBF becomes shorter Important Issue : Recovery from faults Conventional


slide-1
SLIDE 1

Spare Node Substitution for Failure Nodes

Kazumi Yoshinaga RIKEN AICS

slide-2
SLIDE 2

Background

  • In the Exa-flops era, faults could happen more frequently than

ever → System MTBF becomes shorter

  • Important Issue : Recovery from faults
  • Conventional method : System-level Checkpoint-Restart

– Requires massive I/O

  • Many mechanisms to survive failures have been proposed and

investigated

– Less I/O Size – One of the mechanisms is ULFM(User-Level Fault Mitigation).

  • User program handles failures
  • The program can survive from the failures and continue its execution
  • But there is no discussion how a job should survive from node

failures

slide-3
SLIDE 3

Purpose of this Research

  • What is the best way to survive from node

failures ?

– Assuming a job can survive from a node failure by using an existing fault mitigation software – Not to propose a new fault mitigation mechanism – Propose recovery strategy

slide-4
SLIDE 4

Survival from Node Failure

  • Applications with dynamic load balancing

– e.g. Distributed Master-Worker model – Avoiding failure nodes method – Applications continue its execution only with healthy nodes after failure

  • How about applications without dynamic load

balancing?

– e.g. Stencil Computation

slide-5
SLIDE 5

Avoiding Failure Node(s) for Stencil Computation

  • Stencil computation characteristics

– Communication pattern is fixed – Load can be balanced

  • When a recovery happens, above stencil

computation characteristics must be preserved

  • However,

– Hard to balance loads – Impossible to preserve communication pattern – Every time a new failure happens, communication pattern can differ

  • Hard to program !!!

Failure x1.5 computation New comm. pattern

Using spare nodes to solve these problems

slide-6
SLIDE 6

Using Spare Nodes

  • An application runs with spare nodes
  • If node failure happens, migrate the task

running on failed node to the spare node

– Loads are balanced (continues with the same # procs.) – Preserve logical communication pattern – No change in the kernel part of application – Some penalties

slide-7
SLIDE 7

Spare Node Penalty-1

  • System utilization Degradation-

2 4 6 8 10 12 14 1,000 10,000 100,000 1,000,000

% Spare Nodes # Nodes 3D(3,1) 3D(2,1) 3D(1,1) 2D(2,1) 2D(1,1)

  • Spare node allocation
  • System utilization is decreased

nD(α,β) n: Dimensions of networks α: # dimensions of spare nodes β: spare nodes width

slide-8
SLIDE 8

Spare Node Penalty-2

  • Communication Performance Degradation-
  • Logical communication pattern can be

preserved

  • by creating a new MPI communicator to exclude the

failed node and include a spare node.

  • However, physical communication pattern is

not the same, and communication performance(CP) can be degraded.

  • Larger hop counts (latency), and
  • Possible message collisions
slide-9
SLIDE 9
  • Ex. CP Degradation of Spare Node Substitution
  • Nodes on the topmost

row work as spare nodes

  • Up to 5 possible

collisions after 1 node failure

– Independent from the # nodes

2D Cartesian network topology (XY routing ) 5-point Stencil Computation

How faulty nodes should be replaced by spare nodes?

slide-10
SLIDE 10

Sliding Substitution(1)

  • We proposed “Sliding Substitution” methods

– 0D Sliding (simple replace)

  • Failed rank is continued on an alternative node

– 1D Sliding

  • Processes between the failure node and the spare node are shifted

– 2D Sliding

  • Whole processes between the failure node's row(column) and the spare node's

row(column) are shifted

– 3D Sliding, 4D , 5D… 14 24 31 25 18 19 6 7 1 32 26 33 27 8 2 9 3 34 28 35 29 22 23 17 10 4 11 5 15 13 12 16 20 21 30 20 20 14 24 31 25 18 19 6 7 1 32 26 33 27 8 2 9 3 34 28 35 29 22 23 17 10 4 11 5 15 13 12 16 20 21 30 20 26 20 32 14 24 31 25 18 19 6 7 1 32 26 33 27 8 2 9 3 34 28 35 29 22 23 17 10 4 11 5 15 13 12 16 20 21 30 20 18 19 22 23 21 18 25 19 18 19 26 20 27 21 28 22 29 23 22 23 20 21 24 31 32 33 34 35 30 0D Sliding 1D Sliding 2D Sliding

slide-11
SLIDE 11

Preliminary Evaluation

  • 5D stencil on 2D network-
  • Spare Allocation

2D(2,1) > 2D(1,1)

  • Max. Failure

– 0D: up to # Spare – 1D: 3 (or more) – 2D: up to 2 (2D

  • Cart. Topo.)
  • Comm. Perf.

2D > 1D > 0D

5 10 15 20 25 30 1 2 3 4 5 6 7 8 9 10 Torus Mesh 5 10 15 20 25 30 1 2 3 4 5 2 4 6 8 1 2 3 4 5 2 4 6 8 1 2 3 4 5

0D : 2D(1,1) 0D : 2D(2,1) 1D : 2D(2,1) 2D : 2D(2,1) # Failed Nodes # Failed Nodes

  • Max. Collisions
  • Max. Collisions
slide-12
SLIDE 12

Sliding Substitution(2)

  • The higher the dimension

– The better the performance – The smaller the number of the failure nodes it can handle

  • 2D or higher dimension Sliding

– Migrate tasks running on healthy nodes – Free nodes works as new spare nodes

  • Hybrid Sliding

– 3D → 2D → 1D → 0D (on 3D network)

Works as new spare nodes 3D Sliding

slide-13
SLIDE 13

Evaluation : 7P-Stencil on the K and BG/Q (Hybrid, 3D(2,1), 4MiB)

  • K computer

: up to 8 times slower

  • BG/Q

: up to 12 times slower Smaller is better

5 10 15 20 25 30 35 40 100 200 300

Relative latency # Failed Nodes

5 10 15 20 25 30 35 40 45 50 100 150 200

# Failed Nodes

  • Sim. Avg.
  • Sim. Worst
  • Sim. Best
  • Exp. Worst

The K Computer 12x12x12 Nodes (calc. 11x11x12) BG/Q 16x8x8 Nodes (calc. 15x7x8)

slide-14
SLIDE 14

Evaluation: Collectives on the K and BG/Q (Hybrid, 3D(2,1))

1 2 3 4 5 6 1 2 100 200 276

Barrier(K)

1 2 3 4 5 6 1 2 100 200 276

Allreduce(K)

  • Rel. latency

(Worst Case)

Smaller is better

2 4 6 8 10 0.2 0.4 0.6 0.8 1 1.2 1 2 100 184 0.5 1 1.5 2 0.2 0.4 0.6 0.8 1 1.2 1 2 100 184

Barrier(BG/Q) Allreduce(BG/Q)

  • Rel. latency

(Worst Case)

(Based on 16x8x8) (Based on 16x8x8)

  • Rel. latency

(Worst Case)

Smaller is better

  • On the K and BG/Q, collective operations are optimized for their network
  • Having spare nodes makes the optimization very difficult
  • BG/Q’s optimization works only with MPI_COMM_WORLD

# Failed Nodes # Failed Nodes # Failed Nodes # Failed Nodes

slide-15
SLIDE 15

Summary

  • We proposed and compared “Sliding

Substitution” methods.

  • Communication performance degradation is
  • bserved

– 7P-Stencil :

  • Simulation results: up to 40 collisions
  • Experimental results: up to 12 times larger latency

– Collective communications:

  • up to 12 times lager latency (BG/Q, Barrier)
slide-16
SLIDE 16

Future Work

  • Evaluations with real applications
  • Node-Rank re-mapping algorithms, or better

substitution methods

  • Discussion on the other network topology

– Experiments using Tsubame 2.5 (Fat-tree) is scheduled