Accurate Prediction of Soft Error Vulnerability of Scientific - - PowerPoint PPT Presentation

accurate prediction of soft error vulnerability of
SMART_READER_LITE
LIVE PREVIEW

Accurate Prediction of Soft Error Vulnerability of Scientific - - PowerPoint PPT Presentation

Accurate Prediction of Soft Error Vulnerability of Scientific Applications Greg Bronevetsky Post-doctoral Fellow Lawrence Livermore National Lab Soft error: one-time corruption of system state Examples: Memory bit-flips, erroneous


slide-1
SLIDE 1

Accurate Prediction of Soft Error Vulnerability of Scientific Applications

Greg Bronevetsky Post-doctoral Fellow Lawrence Livermore National Lab

slide-2
SLIDE 2

Soft error: one-time corruption of system state

  • Examples: Memory bit-flips, erroneous

computations

  • Caused by

– Chip variability – Charged particles passing through transistors

  • Decay of packaging materials (Lead208, Boron10)
  • Fission due to cosmic neutrons

– Temperature, power fluctuations

slide-3
SLIDE 3

Soft errors are a critical reliability challenge for supercomputers

  • Real Machines:

– ASCI Q: 26 radiation-induced errors/week – Similar-size Cray XD1: 109 errors/week (estimated) – BlueGene/L: 3-4 L1 cache bit flips/day

  • Problem grows worse with time

– Larger machines ⇒ larger error probability – SRAMs growing exponentially more vulnerable per chip

slide-4
SLIDE 4

We must understand the impact of soft errors on applications

  • Soft errors corrupt application state
  • May lead to crashes or
  • Need to detect/tolerate soft errors

– State of the art: checkers/correctors for individual algorithms – No general solution

  • Must first understand how errors affect

applications

– Identify problem – Focus efforts

corrupt output

slide-5
SLIDE 5

Prior work says very little about most applications

  • Prior fault analysis work focuses on

injecting errors into individual applications

– [Lu and Reed, SC04]: Linux + MPICH + Cactus, NAMD, CAM – [Messer et al, ICSDN00]: Linux + Apache and Linux + Java (Jess, DB, Javac, Jack) – [Some et al, AC02]: Lynx + Mars texture segmentation application …

  • Where’s my application?
slide-6
SLIDE 6

Extending vulnerability characterization to more applications

  • Goal: general purpose vulnerability

characterization

– Same accuracy as per-application fault injection – Much cheaper

  • Initial steps

– Fault injection iterative linear algebra methods – Library-based fault vulnerability analysis …

slide-7
SLIDE 7

Step 1: Analyzing fault vulnerability

  • f iterative methods
  • Target domain:

solvers for sparse linear problem Ax=b

  • Goal:

understand error vulnerability of class of algorithms

– Raw error rates – Effectiveness of potential solutions

  • Error model: memory bit-flips
slide-8
SLIDE 8

Possible run outcomes

  • Success: <10% error
  • Silent Data Corruption (SDC): ≥10% error
  • Hang: method doesn’t reach target

tolerance

  • Abort: SegFault or failed SparseLib check
slide-9
SLIDE 9

Errors cause SDCs, Hangs, Aborts in ~8-10%, each

slide-10
SLIDE 10

Large scale applications vulnerable to silent data corruptions

  • Scaled to 1-day, 1,000-processor run of

application that only calls iterative method

10FIT/MB DRAM (1,000-5,000 Raw FIT/MB, 90%-98% effective error correction)

slide-11
SLIDE 11

Larger scale applications even more vulnerable to silent data corruptions

  • Scaled to 10-day, 100,000-processor run of

application that only calls iterative method

10FIT/MB DRAM (1,000-5,000 Raw FIT/MB, 90%-98% effective error correction)

slide-12
SLIDE 12

Error Detectors

Base

slide-13
SLIDE 13

Convergence detectors reduce SDC at <20% overhead

Base

slide-14
SLIDE 14

Convergence detectors reduce SDC at <20% overhead

Base

slide-15
SLIDE 15

Native detectors have little effect at little cost

Base

slide-16
SLIDE 16

Encoding-based detectors significantly reduce SDC at high cost

Base

slide-17
SLIDE 17

Encoding-based detectors significantly reduce SDC at high cost

Base

slide-18
SLIDE 18

First general analysis of error vulnerability of algorithm class

  • Vulnerability analysis for class of common

subroutines

  • Described raw error vulnerability
  • Analyzed various detection/tolerance

techniques

– No clear winner, rules of thumb

slide-19
SLIDE 19

Step 2: Vulnerability analysis of library-based applications

  • Many applications mostly composed of

calls to library routines

  • If error hits some routine, output will be

corrupted

  • Later routines:

corrupted inputs ⇒ corrupted outputs

Inputs Outputs

(Work in progress)

slide-20
SLIDE 20

Idea: predict application vulnerability from routine profiles

  • Library implementors provide vulnerability

profile for each routine:

– Error pattern in routine’s output after errors – Function that maps input error patterns to

  • utput error patterns

Inputs Outputs

slide-21
SLIDE 21

Idea: predict application vulnerability from routine profiles

  • Given application’s dependence graph

– Simulate effect of error in each routine – Average over all error locations to produce error pattern at outputs

Inputs Outputs

slide-22
SLIDE 22

Examined applications that use BLAS and LAPACK

  • 12 routines ≥O(n2), double precision real

numbers

– Matrix-vector multiplication – DGEMV – Matrix-matrix multiplication – DGEMM – Rank-1 update – DGER – Linear least squares – DGESV, DGELS – SVD factorization – DGESVD, DGGSVD, DGESDD – Eigenvectors: DGEEV, DGGEV, DGEES, DGGES

slide-23
SLIDE 23

Examined applications that use BLAS and LAPACK

  • 12 routines ≥O(n2), double precision real

numbers

  • Executed on randomly-generated nxn

matrixes

(n=62, 125, 250, 500)

  • BLAS/LAPACK from Intel’s Math Kernel

Library on Opteron(MLK10) and Itanium2(MKL8)

– Same results on both

  • Error model: memory bit-flips
slide-24
SLIDE 24

Error patterns: multiplicative error histograms

DGEMM

slide-25
SLIDE 25

Output error patterns fall into few major categories

1.E‐08 1.E‐06 1.E‐04 1.E‐02 1.E+00 1.E‐08 1.E‐06 1.E‐04 1.E‐02 1.E+00

DGGES Output beta - 62x1 DGESV Output L - 62x62 DGGES Output vsr - 62x62 DGEMM Output C - 62x62

slide-26
SLIDE 26

Error patterns may vary with matrix size

1.E‐07 1.E‐05 1.E‐03 1.E‐01 1.E‐07 1.E‐05 1.E‐03 1.E‐01

DGGSVD Output beta DGGSVD Output V

62 125 250 500

slide-27
SLIDE 27

Input-Output error transition functions

  • Input-Output error transition functions:

trained predictors

– Linear Least Squares – Support Vector Machines

(linear, 2nd degree polynomial, rbf kernels)

– Artificial Neural Nets

(3,10,100 layers,; linear, gaussian, gaussian symmetric and sigmoid transfer functions)

slide-28
SLIDE 28

Trained on multiple input error patterns

  • DataInj: single bit errors
  • DataInj-R: output errors of routines with DataInj

inputs

  • UniInj: uniform multiplicative errors ∈[-100,100]
  • UniInj-R: output errors of routines with UniInj

inputs

  • Inj-R: output errors of error injected routines
slide-29
SLIDE 29

Input-Output error transition functions

  • Input-Output error transition functions: trained

predictors

– Linear Least Squares – Support Vector Machines – Artificial Neural Nets

  • Trained on sample input error patterns

uniform multiplicative errors∈[-100,100] UniInj:

  • utputs of routines with UniInj inputs

UniInj-R: single bit errors DataInj:

  • utputs of error injected routines

Inj-R:

  • utputs of routines with DataInj inputs

DataInj-R:

slide-30
SLIDE 30

Output errors depend

  • n input errors
  • Equivalence classes

– DataInj, DataInj-R | Inj-R – DataUni, DataUni-R

slide-31
SLIDE 31

Evaluated accuracy of all predictors

  • n all training sets
  • Error metric:

– probability of error ≥δ – δ∈{1e-14, 1e-13, …, 2, 10, 100)

1E‐10 1E‐09 1E‐08 1E‐07 1E‐06 1E‐05 0.0001 0.001 0.01 0.1 1

Recorded Predicted

slide-32
SLIDE 32

Evaluated accuracy of all predictors

  • n all training sets

1E‐10 1E‐09 1E‐08 1E‐07 1E‐06 1E‐05 0.0001 0.001 0.01 0.1 1

Recorded Predicted

1E‐10 1E‐09 1E‐08 1E‐07 1E‐06 1E‐05 0.0001 0.001 0.01 0.1 1

Recorded Predicted 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Recorded Predicted Error 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Recorded Predicted Error

slide-33
SLIDE 33

Linear Least Squares has best accuracy, Neural nets worst

Evaluation set: union of all training sets

slide-34
SLIDE 34

Linear Least Squares has best accuracy, Neural nets worst

slide-35
SLIDE 35

Accuracy varies among predictors

DGEES, output wr

slide-36
SLIDE 36

Linear Least Squares has best accuracy, Neural nets worst

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% e‐HeapInj.none.All uni0‐1All inj0‐1All e‐uni0‐1All e‐inj0‐1All

1.E‐14 2.E‐14 4.E‐14 9.E‐14 2.E‐13 3.E‐13 7.E‐13 2.E‐11 7.E‐10 2.E‐08 7.E‐07 2.E‐05 7.E‐04 2.E‐02 8.E‐01 2.E+00 1.E+01

slide-37
SLIDE 37

Linear Least Squares has best accuracy, Neural nets worst

Inj-R DataInj DataInj-R DataUni DataUni-R

slide-38
SLIDE 38

Evaluated predictors on randomly- generated applications

  • Application has constant number of levels
  • Constant number of operations per level
  • Operations use as input data from prior

level(s)

Inputs Outputs

slide-39
SLIDE 39

Neural Nets: Poor accuracy for application vulnerability prediction

1E‐10 1E‐09 1E‐08 1E‐07 1E‐06 1E‐05 0.0001 0.001 0.01 0.1 1

Recorded Predicted

Function=sigmoid, 3 hidden layers

slide-40
SLIDE 40

1E‐10 1E‐09 1E‐08 1E‐07 1E‐06 1E‐05 0.0001 0.001 0.01 0.1 1

Recorded Predicted

Linear Least Squares: Good accuracy, restricted

slide-41
SLIDE 41

1E‐10 1E‐09 1E‐08 1E‐07 1E‐06 1E‐05 0.0001 0.001 0.01 0.1 1

Recorded Predicted

SVMs: Good accuracy, general

Function=rbf, gamma=1.0

slide-42
SLIDE 42

Work is still in progress

  • Correlating accuracy of input/output

predictors to accuracy of application prediction

  • More detailed fault injection
  • Applications with loops
  • Real applications
slide-43
SLIDE 43

Step 3: Compiler analyses

  • No need to focus on external libraries
  • Can use compiler analysis to

– Do fault injection/propagation on per-function basis – Propagate error profiles through more data structures (matrix, scalar, tree, etc.)

slide-44
SLIDE 44

Step 4: Scalable analysis of parallel applications

  • Cannot do fault injection on 1,000-process

application

  • Can modularize fault injection

– Inject into individual processes

slide-45
SLIDE 45

Step 4: Scalable analysis of parallel applications

  • Cannot do fault injection on 1,000-process

application

  • Can modularize fault injection

– Inject into single-process runs – Propagate through small-scale runs

slide-46
SLIDE 46

Working toward understanding application vulnerability to errors

  • Soft errors becoming increasing problem
  • n HPC systems
  • Must understand how applications react to

soft errors

  • Traditional approaches inefficient for

realistic applications

  • Developing tools to cheaply understand

vulnerability of real scientific applications