[PPT] - Accurate Prediction of Soft Error Vulnerability of Scientific PowerPoint Presentation

SLIDE 1

Accurate Prediction of Soft Error Vulnerability of Scientific Applications

Greg Bronevetsky Post-doctoral Fellow Lawrence Livermore National Lab

SLIDE 2

Soft error: one-time corruption of system state

Examples: Memory bit-flips, erroneous

computations

Caused by

– Chip variability – Charged particles passing through transistors

Decay of packaging materials (Lead208, Boron10)
Fission due to cosmic neutrons

– Temperature, power fluctuations

SLIDE 3

Soft errors are a critical reliability challenge for supercomputers

Real Machines:

– ASCI Q: 26 radiation-induced errors/week – Similar-size Cray XD1: 109 errors/week (estimated) – BlueGene/L: 3-4 L1 cache bit flips/day

Problem grows worse with time

– Larger machines ⇒ larger error probability – SRAMs growing exponentially more vulnerable per chip

SLIDE 4

We must understand the impact of soft errors on applications

Soft errors corrupt application state
May lead to crashes or
Need to detect/tolerate soft errors

– State of the art: checkers/correctors for individual algorithms – No general solution

Must first understand how errors affect

applications

– Identify problem – Focus efforts

corrupt output

SLIDE 5

Prior work says very little about most applications

Prior fault analysis work focuses on

injecting errors into individual applications

– [Lu and Reed, SC04]: Linux + MPICH + Cactus, NAMD, CAM – [Messer et al, ICSDN00]: Linux + Apache and Linux + Java (Jess, DB, Javac, Jack) – [Some et al, AC02]: Lynx + Mars texture segmentation application …

Where’s my application?

SLIDE 6

Extending vulnerability characterization to more applications

Goal: general purpose vulnerability

characterization

– Same accuracy as per-application fault injection – Much cheaper

Initial steps

– Fault injection iterative linear algebra methods – Library-based fault vulnerability analysis …

SLIDE 7

Step 1: Analyzing fault vulnerability

f iterative methods
Target domain:

solvers for sparse linear problem Ax=b

Goal:

understand error vulnerability of class of algorithms

– Raw error rates – Effectiveness of potential solutions

Error model: memory bit-flips

SLIDE 8

Possible run outcomes

Success: <10% error
Silent Data Corruption (SDC): ≥10% error
Hang: method doesn’t reach target

tolerance

Abort: SegFault or failed SparseLib check

SLIDE 9

Errors cause SDCs, Hangs, Aborts in ~8-10%, each

SLIDE 10

Large scale applications vulnerable to silent data corruptions

Scaled to 1-day, 1,000-processor run of

application that only calls iterative method

10FIT/MB DRAM (1,000-5,000 Raw FIT/MB, 90%-98% effective error correction)

SLIDE 11

Larger scale applications even more vulnerable to silent data corruptions

Scaled to 10-day, 100,000-processor run of

application that only calls iterative method

10FIT/MB DRAM (1,000-5,000 Raw FIT/MB, 90%-98% effective error correction)

SLIDE 12

Error Detectors

Base

SLIDE 13

Convergence detectors reduce SDC at <20% overhead

Base

SLIDE 14

Convergence detectors reduce SDC at <20% overhead

Base

SLIDE 15

Native detectors have little effect at little cost

Base

SLIDE 16

Encoding-based detectors significantly reduce SDC at high cost

Base

SLIDE 17

Encoding-based detectors significantly reduce SDC at high cost

Base

SLIDE 18

First general analysis of error vulnerability of algorithm class

Vulnerability analysis for class of common

subroutines

Described raw error vulnerability
Analyzed various detection/tolerance

techniques

– No clear winner, rules of thumb

SLIDE 19

Step 2: Vulnerability analysis of library-based applications

Many applications mostly composed of

calls to library routines

If error hits some routine, output will be

corrupted

Later routines:

corrupted inputs ⇒ corrupted outputs

Inputs Outputs

(Work in progress)

SLIDE 20

Idea: predict application vulnerability from routine profiles

Library implementors provide vulnerability

profile for each routine:

– Error pattern in routine’s output after errors – Function that maps input error patterns to

utput error patterns

Inputs Outputs

SLIDE 21

Idea: predict application vulnerability from routine profiles

Given application’s dependence graph

– Simulate effect of error in each routine – Average over all error locations to produce error pattern at outputs

Inputs Outputs

SLIDE 22

Examined applications that use BLAS and LAPACK

12 routines ≥O(n2), double precision real

numbers

– Matrix-vector multiplication – DGEMV – Matrix-matrix multiplication – DGEMM – Rank-1 update – DGER – Linear least squares – DGESV, DGELS – SVD factorization – DGESVD, DGGSVD, DGESDD – Eigenvectors: DGEEV, DGGEV, DGEES, DGGES

SLIDE 23

Examined applications that use BLAS and LAPACK

12 routines ≥O(n2), double precision real

numbers

Executed on randomly-generated nxn

matrixes

(n=62, 125, 250, 500)

BLAS/LAPACK from Intel’s Math Kernel

Library on Opteron(MLK10) and Itanium2(MKL8)

– Same results on both

Error model: memory bit-flips

SLIDE 24

Error patterns: multiplicative error histograms

DGEMM

SLIDE 25

Output error patterns fall into few major categories

1.E‐08 1.E‐06 1.E‐04 1.E‐02 1.E+00 1.E‐08 1.E‐06 1.E‐04 1.E‐02 1.E+00

DGGES Output beta - 62x1 DGESV Output L - 62x62 DGGES Output vsr - 62x62 DGEMM Output C - 62x62

SLIDE 26

Error patterns may vary with matrix size

1.E‐07 1.E‐05 1.E‐03 1.E‐01 1.E‐07 1.E‐05 1.E‐03 1.E‐01

DGGSVD Output beta DGGSVD Output V

62 125 250 500

SLIDE 27

Input-Output error transition functions

Input-Output error transition functions:

trained predictors

– Linear Least Squares – Support Vector Machines

(linear, 2nd degree polynomial, rbf kernels)

– Artificial Neural Nets

(3,10,100 layers,; linear, gaussian, gaussian symmetric and sigmoid transfer functions)

SLIDE 28

Trained on multiple input error patterns

DataInj: single bit errors
DataInj-R: output errors of routines with DataInj

inputs

UniInj: uniform multiplicative errors ∈[-100,100]
UniInj-R: output errors of routines with UniInj

inputs

Inj-R: output errors of error injected routines

SLIDE 29

Input-Output error transition functions

Input-Output error transition functions: trained

predictors

– Linear Least Squares – Support Vector Machines – Artificial Neural Nets

Trained on sample input error patterns

uniform multiplicative errors∈[-100,100] UniInj:

utputs of routines with UniInj inputs

UniInj-R: single bit errors DataInj:

utputs of error injected routines

Inj-R:

utputs of routines with DataInj inputs

DataInj-R:

SLIDE 30

Output errors depend

n input errors
Equivalence classes

– DataInj, DataInj-R | Inj-R – DataUni, DataUni-R

SLIDE 31

Evaluated accuracy of all predictors

n all training sets
Error metric:

– probability of error ≥δ – δ∈{1e-14, 1e-13, …, 2, 10, 100)

1E‐10 1E‐09 1E‐08 1E‐07 1E‐06 1E‐05 0.0001 0.001 0.01 0.1 1

Recorded Predicted

SLIDE 32

Evaluated accuracy of all predictors

n all training sets

1E‐10 1E‐09 1E‐08 1E‐07 1E‐06 1E‐05 0.0001 0.001 0.01 0.1 1

Recorded Predicted

1E‐10 1E‐09 1E‐08 1E‐07 1E‐06 1E‐05 0.0001 0.001 0.01 0.1 1

Recorded Predicted 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Recorded Predicted Error 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Recorded Predicted Error

SLIDE 33

Linear Least Squares has best accuracy, Neural nets worst

Evaluation set: union of all training sets

SLIDE 34

Linear Least Squares has best accuracy, Neural nets worst

SLIDE 35

Accuracy varies among predictors

DGEES, output wr

SLIDE 36

Linear Least Squares has best accuracy, Neural nets worst

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% e‐HeapInj.none.All uni0‐1All inj0‐1All e‐uni0‐1All e‐inj0‐1All

1.E‐14 2.E‐14 4.E‐14 9.E‐14 2.E‐13 3.E‐13 7.E‐13 2.E‐11 7.E‐10 2.E‐08 7.E‐07 2.E‐05 7.E‐04 2.E‐02 8.E‐01 2.E+00 1.E+01

SLIDE 37

Linear Least Squares has best accuracy, Neural nets worst

Inj-R DataInj DataInj-R DataUni DataUni-R

SLIDE 38

Evaluated predictors on randomly- generated applications

Application has constant number of levels
Constant number of operations per level
Operations use as input data from prior

level(s)

Inputs Outputs

SLIDE 39

Neural Nets: Poor accuracy for application vulnerability prediction

1E‐10 1E‐09 1E‐08 1E‐07 1E‐06 1E‐05 0.0001 0.001 0.01 0.1 1

Recorded Predicted

Function=sigmoid, 3 hidden layers

SLIDE 40

1E‐10 1E‐09 1E‐08 1E‐07 1E‐06 1E‐05 0.0001 0.001 0.01 0.1 1

Recorded Predicted

Linear Least Squares: Good accuracy, restricted

SLIDE 41

1E‐10 1E‐09 1E‐08 1E‐07 1E‐06 1E‐05 0.0001 0.001 0.01 0.1 1

Recorded Predicted

SVMs: Good accuracy, general

Function=rbf, gamma=1.0

SLIDE 42

Work is still in progress

Correlating accuracy of input/output

predictors to accuracy of application prediction

More detailed fault injection
Applications with loops
Real applications

SLIDE 43

Step 3: Compiler analyses

No need to focus on external libraries
Can use compiler analysis to

– Do fault injection/propagation on per-function basis – Propagate error profiles through more data structures (matrix, scalar, tree, etc.)

SLIDE 44

Step 4: Scalable analysis of parallel applications

Cannot do fault injection on 1,000-process

application

Can modularize fault injection

– Inject into individual processes

SLIDE 45

Step 4: Scalable analysis of parallel applications

Cannot do fault injection on 1,000-process

application

Can modularize fault injection

– Inject into single-process runs – Propagate through small-scale runs

SLIDE 46

Working toward understanding application vulnerability to errors

Soft errors becoming increasing problem
n HPC systems
Must understand how applications react to

soft errors

Traditional approaches inefficient for

realistic applications

Developing tools to cheaply understand