Coping with Soft Errors in Asynchronous Burst-Mode Machines Sobeeh - - PowerPoint PPT Presentation

coping with soft errors in asynchronous burst mode
SMART_READER_LITE
LIVE PREVIEW

Coping with Soft Errors in Asynchronous Burst-Mode Machines Sobeeh - - PowerPoint PPT Presentation

Coping with Soft Errors in Asynchronous Burst-Mode Machines Sobeeh Almukhaizim Feng Shi & Yiorgos Makris Computer Engineering Dept. Electrical Engineering Dept. Kuwait University, Kuwait Yale University, USA 4/11/2008 1 ASYNC08


slide-1
SLIDE 1

Coping with Soft Errors in Asynchronous Burst-Mode Machines

Sobeeh Almukhaizim Computer Engineering Dept. Kuwait University, Kuwait Feng Shi & Yiorgos Makris Electrical Engineering Dept. Yale University, USA

4/11/2008 1

ASYNC’08

slide-2
SLIDE 2

Sources of Soft Errors

“Galactic Particles” Are high-energy particles that penetrate to Earth’s surface, through buildings and walls “Solar Particles” Affect satellites; may also penetrate to Earth

  • High-energy particles collide with silicon atoms
  • Collision generates a voltage pulse at impact site

1

  • Under certain conditions, it may produce a soft error

4/11/2008 2

slide-3
SLIDE 3

Frequency of Soft Errors

Soft Error Rate Trends [S. Borkar et al., Intel, DAC’04]

50 100 150 180 130 90 65 45 32 22 16 Chip Feature Size Relative Soft Error Rate Increase

we are approximately here 6 years from now

  • Integrated circuits (synchronous & asynchronous) will require

methods to tolerate / mitigate soft errors and ensure reliability

4/11/2008 3

ASYNC’08

slide-4
SLIDE 4

Soft Error Tolerance & Mitigation in ASYNC

  • Previous studies targeted Quasi Delay-Insensitive (QDI) circuits
  • SEU-tolerant QDI circuits (W. Jang & A. Martin, ASYNC, 156-165, 2005):
  • Gate-level fine-grain duplication and double-checking
  • SEU-tolerant QDI circuits (W. Jang & A. Martin, ASYNC, 156-165, 2005):
  • Gate-level fine-grain duplication and double-checking
  • Fine granularity results in high overhead

4/11/2008

  • Soft error susceptibility estimation & mitigation in QDI Circuits

(Y. Monnet, M. Renaudin, and R. Leveugle, Trans. on Computers, , 55(9): 1104-1115 (2006))):

  • Susceptibility (or sensitivity) is defined with respect to the number of

errors at the inputs of the C-element that are necessary to flip its state

  • Several soft error mitigation (or hardening) methods are presented

ASYNC’08

4

x y

  • z

w C element C element x2 y2

  • 2

x1 y1

  • 1

z1 z2 C element C element w2 w1

1

Transient error is blocked by C-elements Transient error is blocked by C-elements

1

slide-5
SLIDE 5

Asynchronous Burst-Mode Machines

  • Interaction between the circuit and its environment

happens in Bursts:

  • Input Burst: a set of bit changes in any order and at any time
  • Interaction between the circuit and its environment

happens in Bursts:

  • Input Burst: a set of bit changes in any order and at any time
  • Outputs and state do not change during an input burst
  • Once the input burst is complete, the circuit responds with a

hazard-free output burst

Asynchronous Controller

Inputs Outputs 1 1 1 1 1

  • Particle strikes may cause logic errors or hazards

5 4/11/2008

ASYNC’08

  • Interaction between the circuit and its environment

happens in Bursts:

  • Input Burst: a set of bit changes in any order and at any time
  • Outputs and state do not change during an input burst
slide-6
SLIDE 6

Coping with Soft Errors in ABMMs

Mitigation Techniques Methods to Cope with Soft Errors in ABMMs Duplication

  • Based

TMR

  • Based

Tolerance Techniques

4/11/2008 6

ASYNC’08

slide-7
SLIDE 7

TMR-based Soft Error Tolerance in ABMMs

C element C element Output Replica 1 Replica 2 Inputs State Original Circuit 0 1

1 1 1

4/11/2008 7

ASYNC’08

  • C-element used as majority voter
  • Strikes at state-line C-elements not tolerated
slide-8
SLIDE 8

Duplication-based Soft Error Tolerance

  • Observation: 2-input C-elements are sufficient to tolerate one failing

module (i.e., only one replica is needed)

C element C element Output Replica Inputs State Original Circuit

  • Strikes at state-line C-elements still not tolerated

4/11/2008 8

ASYNC’08

slide-9
SLIDE 9

Tolerating Errors on State-Line C-Elements

  • Proposed Solution: cross-coupled structure of C-elements

C element Output Replica Inputs State1 Original Circuit C element C element C element C element State2 Transient error is blocked Transient error is blocked

  • All strikes at state-line C-elements are now tolerated

1 1 1 1

4/11/2008 9

ASYNC’08

slide-10
SLIDE 10

Example

  • 1. Insert original circuit
  • 2. Insert duplicate circuit
  • 3. Insert state-line C-elements

4/11/2008

  • 4. Insert output C-elements

ASYNC’08

10

slide-11
SLIDE 11

Experimental Results

Duplication-based Soft Error Tolerance

Circuit Name I/S/O Original Duplicate C-elements Total Overhead hp-ir 3/1/2 8 8 18 34 325.00% concur-mixer 3/2/3 16 16 33 65 306.25% tangram-mixer 3/1/2 10 10 18 38 280.00% rf-control 6/3/5 37 37 51 125 237.84% while_concur 4/2/3 24 24 33 81 237.50% barcode 13/4/17 172 172 99 443 157.56% p2 8/4/16 192 192 96 480 150.00% p1 13/4/14 238 238 90 566 137.82%

Area overhead seems excessive for small circuits: cost inflated due to proportionately large number of C-elements over logic gates, and the rather expensive C-element implementation used

4/11/2008 11

ASYNC’08

slide-12
SLIDE 12

Coping with Soft Errors in ABMMs

Methods to Cope with Soft Errors in ABMMs Tolerance Techniques Sensitive partial logic cones Sensitive complete logic cones Sensitive gates Soft error susceptibility estimation Mitigation Techniques

4/11/2008 12

ASYNC’08

slide-13
SLIDE 13

Soft Error Susceptibility Estimation

4/11/2008 13

  • A hazard-aware asynchronous fault simulator is needed

(SPIN-SIM: F. Shi and Y. Makris, ITC, 597-606 (2004))

SIB1 SIB2 SIBm . . Potential SETs f1 State & Input Burst Pair f2 .. fp 11000 00000 .. 00001 01001 11001 .. 11001 .. .. .. .. 11001 00010 .. 00000

susc(Gq) = ∑ ∑ E(sest[i,j])

i=1 j=s+1 m s+kq

m . kq , where s = ∑ kl

l=1 q-1

SER(ABMM) = ∑ sest(Gq)

q=1 n

  • Fault simulate & construct a soft error susceptibility table (sest)
  • Asymmetric soft error susceptibility of gates in different levels
  • Enables judicious selection and replication in a partial duplicate

(K. Mohanram and N. A. Touba, ITC, 893-901 (2003)) ASYNC’08

slide-14
SLIDE 14

Duplication of Sensitive Gates

  • Using a duplication-based soft error tolerant ABMM:

1. Gates are remove from the first level of the duplicate in an increasing

  • rder of their soft error susceptibility

2. Fan-outs are driven by the corresponding gate in the original ABMM 3. Area & soft error tolerance are updated accordingly

  • Using a duplication-based soft error tolerant ABMM:

1. Gates are remove from the first level of the duplicate in an increasing

  • rder of their soft error susceptibility

2. Fan-outs are driven by the corresponding gate in the original ABMM

  • Using a duplication-based soft error tolerant ABMM:

1. Gates are remove from the first level of the duplicate in an increasing

  • rder of their soft error susceptibility

Gates removed Drives fan-outs

  • f

removed gates Cost: 87% Tolerance : 68%

4/11/2008 14

ASYNC’08

slide-15
SLIDE 15

Duplication of Complete Sensitive Logic Cones

4/11/2008 15

  • Output/State logic cones also have an asymmetric susceptibility:
  • Select a subset that meets an area target & whose replication

maximizes the number of tolerated pairs of SIBs & SETs

  • Output/State logic cones also have an asymmetric susceptibility:
  • Select a subset that meets an area target & whose replication

maximizes the number of tolerated pairs of SIBs & SETs

  • Modeled as an ILP:

Tol(Yk, i, j) = 1, if Yk.VT(sest[i, j]) = 0 0, if Yk.VT(sest[i, j]) > 0 Maximize ∑ ∑ Tol(Yk, i, j), subject to: (i) Ck < Cost (ii)Xs ϵ {0, 1}, for 1 ≤ r ≤ s

i=1 j=1 m p

m SIBs p SETs r state/output lines 1 ≤ k ≤ 2r - 1

Y1 & w not protected Cost: 60% Tolerance : 47% Y0 & x have complete protection

ASYNC’08

slide-16
SLIDE 16

Duplication of Partial Sensitive Logic Cones

4/11/2008 16

  • Combines the previous two approaches:
  • Explores the asymmetric susceptibility of gates and output/state lines

Y1 & w not protected Y0 & x have partial protection Drives fan-out

  • f

removed gate Cost: 50% Tolerance : 24%

ASYNC’08

slide-17
SLIDE 17

Experimental Results

2-level ABMMs

Circuit p1

Area Overhead (%) Soft Error Protection (%)

  • Achieved tolerance is commensurate with the area overhead
  • The partial logic cones mitigation method is consistently better

4/11/2008 17

ASYNC’08

slide-18
SLIDE 18

Experimental Results

Multi-level ABMMs (new release by Columbia Univ.)

Circuit p1 Cost: 70% Tolerance : 84%

Area Overhead (%) Soft Error Protection (%)

Multi-level implementation significantly improves the tradeoff between area overhead & achieved soft error tolerance

4/11/2008 18

ASYNC’08

slide-19
SLIDE 19

Summary

  • Soft error tolerance in ABMMs

Duplication-based solution that improves upon TMR Cross-coupled C-element structure for state-line protection

  • Soft error mitigation in ABMMs

Enables exploration of the trade-off between the achieved soft error tolerance and the incurred area overhead Driven by soft error susceptibility estimation via hazard-aware asynchronous fault simulator (SPIN-SIM) Yields 3 progressively more powerful partial duplication options

4/11/2008 19

ASYNC’08