[PPT] - A Systems Approach to Computing in Beyond CMOS Fabrics A. D. PowerPoint Presentation

SLIDE 1

A Center

A Systems Approach to Computing in Beyond CMOS Fabrics

A. D. Patil, N. R. Shanbhag,
L. R. Varshney, E. Pop, H.-S.
P. Wong, S. Mitra, J.

Rabaey, J. Weldon, L. Pileggi, S. Manipatruni, D. Nikonov, and I. A. Young

www.sonic-center.org

SLIDE 2

Machines are Beating Humans at Complex Inference Tasks …

recognition, prediction, classification decision-making ……

DATA INFORMATION BMW Sasi Ian [The Guardian, May 2017]

… the energy cost is HUGE (e.g., 𝟐𝟏𝟏𝟏𝟏𝒀)

Extracting information from data

2

SLIDE 3

Existing Solutions are Reaching their Limits

diminishing energy-delay benefits

from CMOS scaling

variations dominate
von Neumann architecture mismatched to

inference workloads requirements

data movement problem (memory wall)

𝑢ℎ

[Pop-NanoResearch-2010]

Row decoder Row decoder Digital processor Decision (𝒛)

K-b bus

d0 WL driver Precharge SA SA SA Mux & buffer Memory

∆𝑾𝑪𝑴

L:1

col. mux

L:1

col. mux

L:1

col. mux

Input buffer (𝑸) d1 d2 d3 … …

device architecture

memory interface processor

3

SLIDE 4

fundamental question

how do we design intelligent machines that

perate at the limits of energy?

4

SLIDE 5

Shannon-inspired Brain-inspired stochastic channel stochastic neural fabric

Key: embrace stochasticity of nanoscale fabrics via Shannon/brain- inspired models of computation

Research Inspiration

reliable systems operating at the limits of energy-efficiency employ stochastic components & statistical models of computation

5

SLIDE 6

C

y ˆ x

Estimator/ Detector

1

y

2

y

N

y

bservations

corrected

utput

, ( , )

P

e h e h

application-derived metric

= max 𝑄(𝑧|𝑧*, . . , 𝑧-) Principles of Statistical Information Processing Prototypes In CMOS & Beyond CMOS

Statistical Information Processing

A principled approach to non von Neumann computing

A Journey from Systems-to-Devices

to Shannon/brain-inspired architectures to

Stochastic nanofunctions, models, prototypes Information-based metrics, fundamental limits, design principles

6

fundamental limits on energy efficiency

FR row decoder FR row decoder

ADC & RDL

BLP SA SA SA Mux & buffer

K-b bus

Input buffer ( ) Cross BL processor (CBLP) Decision ( ) BLP BLP BLP BLP

L:1

col. mux

L:1

col. mux

d0 d1 d2 d3 BLP

L:1

col. mux

Precharge WL driver

deep in-memory arch.

SLIDE 7

Nanofunctions Beyond-CMOS devices device & circuit algorithms & architectures Shannon & Brain-inspired Models of Computing Application requirements prototypes in beyond CMOS

Statistical Information Processing

7

SLIDE 8

Systems in Beyond CMOS using Shannon & Brain-inspired Models of Computing

hyperdimensional (HD) computing via 3D VRRAM– brain-inspired
Shannon-inspired spintronics

8

SLIDE 9

Hyperdimensional (HD) Computing

almost all vectors are nearly-orthogonal
requires random mapping
related to Shannon’s random codes

HD mapper HD Algebra input

utput

4-bit HD ≈ 10000 -bit

𝑇 ⃗ = 𝐵 ⃗ + 𝐶 + 𝐷 ⃗ (𝑇 ⃗ is similar to 𝐵 ⃗, 𝐶 and 𝐷 ⃗) 𝑇 ⃗ = 𝐵 ⃗ ∗ 𝐶 ∗ 𝐷 ⃗ (𝑇 ⃗ is dissimilar to all 𝐵 ⃗, 𝐶 and 𝐷 ⃗)

orthogonalization via local Multiply (XOR)
summarization via local Add (OR)
ordered summarization via Permutation

hypervectors

9

[Rabaey, Olshausen, Mitra, Wong]

MAP

SLIDE 10

RRAM Stochasticity for Random Mapping

Low resistance state: 1 High resistance state: 0

Oxygen vacancy (VO) Stochastic VO

Top electrode

VTE < VSET PSET < 1

Bottom electrode Bottom electrode Top electrode

1001100111……0100111101 (hyper-vector)

PSET: SET probability (switching from ‘0’ to ‘1’)

Binary bits statistically produced/stored in RRAMs

0.7 0.8 0.9 1.0 1.1 10

2

10

3

10

4

Pulse Amplitude (V) Pulse Width (ns)

0. 0. 0. 0. 1.

0% 100% 50% 25% 75%

50%

Experimental data

[H. Li,…, H.-S. P. Wong, IEDM, 2016] 10

SLIDE 11

3D VRRAM as In-Memory Computing Kernels

VRRAM: Vertical Resistive Random Access Memory

TiN TiN TiN (BE) TiN

TiN/Ti

(TE) Layer 1 (L1) Layer 2 (L2) Layer 3 (L3) Layer 4 (L4)

FinFET

Pillar electrode Plane electrode Word line (WL) RRAM cell Bit line (BL) Select line (SL)

z x y

1T-4R

50 nm TiN (20 nm) TiN/Ti (50 nm)

Fab by NDL, Taiwan

11

SLIDE 12

In-Memory MAP Kernels using 3D RRAM

[H. Li,…, H.-S. P. Wong, IEDM, 2016]

1 1 1 1 1 1 1 1

Multiplication

1 1 1 1 1 1 1 1 1

Addition

1 1

Permutation

1M

1k 1M 1G 1T

1M 100k 10k

Resistance Logic Evaluation

C = 0 D = 1

A B C D 1 1 Input AB = pillar addr. = 10

Logic Evaluation Cycle (#) Resistance (W)

10

1

10

3

10

5

10

7

10

9

10

11

7 14 21 28

Current (µA) Addition Cycle (#)

1111 0111 0011 0001 0000

1 1 1

1 2 3 4

1

VDD gnd VDD gnd VDD gnd

L1 L2 L4 L3

200 ns

1 1 1 1 1 1 1 1 1 1 1 1

VDD gnd gnd VDD VDD gnd

200 ns

L1 L2 L4 L3

4 5 6 7 1

Measured HRS (400kΩ-1MΩ) Measured LRS (~10kΩ)

1

Bit 1 up Bit 0 down

12

SLIDE 13

Device-architecture co-design for 3D VRRAM-centric HD computing
MAP operations of HD computing mapped onto 3D VRRAM arrays
> 400 × area reduction (28-nm node) compared with digital design [1]

3D VRRAM In-Memory HD Architecture

Letter (3 layers) Trigram (5 layers) XOR (1 layer) LangMap (6 layers) HamD Measure (21 layers) Input texts 3-letter sequences Compute trigrams MAP Generate (learn) language/text maps (one for each text) Training: finish Inference: measure HamD & identify the ‘nearest’ Binding MAP (addition) (HamD)

Algorithm Architecture Device

One-shot learning

4 kb ´ 36 layers

Random HD vectors Sampling Projection

PERM ADD XOR XOR Store ADD ADD ADD XOR XOR Store

MAP

PSET » 50% [[1] A. Rahimi et al., ISLPED, p.64, 2016]

10

2

10

3

10

4

10

5

10

6

Total Area (µm

2)

28nm LP 3D VRRAM

1 kb 2 kb 10 kb

HD Vector Size

9.16E5 1.78E6 2223 2691 3394

412´ 660´

1 kb 2 kb 10 kb 400 800 1200 1600

Component Area (µm

2)

HD Vector Size

Routing: + 1699.7 Routing: + 2081.5 Routing: + 2158.1

Cell array SA MUX Decoder

(a) (b)

38 13

SLIDE 14

Shannon-inspired Spintronics

But first…......need system-friendly abstract models of spin devices

[Nikonov-JXCDC-2015]

stochastic regime

CMOS SPIN

deterministic regime

Can we make all-spin logic

competitive with CMOS?

deterministic nanomagnet switching

costs much energy & delay

need to design spintronic systems in

the stochastic regime

[Patil, Shanbhag, Manipatruni, Nikonov, Young, MMM-Intermag’16, arXiv’17]

14

SLIDE 15

Systems-friendly Nanomagnetic Model

𝜗 𝐹, 𝑈

> ≈ 𝑓@ AB

C

[Butler-TMAG-2012]

𝜗: Switching error rate 𝐹: Switching energy 𝑗: Current overdrive factor 𝐽HIJK: Critical supply current

𝐹 𝑗, 𝑈

> = 𝑗L𝐽HIJK L

𝑆𝑈

>

40× reduction in switching energy as 𝜗 scales from 10@P to 10@

energy numbers from [Manipatruni, et al., Physical Review Applied’16]

15

𝟗× 𝟗. 𝟒× 𝟔× 𝟕× 𝜗 𝜗 𝜗 𝜗 𝜗 𝜗 𝜗 𝜗 𝜗 𝜗 = 0.5

SLIDE 16

The 𝝑-Noisy Spin Gate Model

includes input dependence
enables evaluation of complex networks

𝜄 ∈ {0,1}: Bernoulli random variable

Pr 𝜄 = 1 = 𝜗

q

{ }

1

t t t

C A B =

t

A

t

B

g

t T

C +

1

!-noisy non-volatile AND gate Error-free gate Virtual gate operation emulating !-noise

" #$%#& '( = *+,+ '-

!

"

# # + !

"

Gate is OFF Gate is ON Switching occurs Gate is OFF

16

SLIDE 17

Shannon-inspired Statistical Error Compensation (SEC)

!

"($)

!

&(')

Desired property: disparity between error probability mass functions

error compensator is robust: 𝜗 < 10@^

𝑧 _ = MAP estimate of 𝑧` Fusion block sparse dense

error compensator is efficient: 𝑚 > 𝑟 & 𝑙 > 𝑛

[Zhang, Shanbhag, IEEE Trans. Signal Processing, 2016] [Abdallah, Shanbhag, IEEE J. solid-state circuits, 2013] [Gonugondla, Shim, Shanbhag, ICASSP, 2016]

17

SLIDE 18

Enhancing Disparity in Error Probablity Mass Functions

Path Delay Balancing (PDB) Path Delay Redistribution (PDR)

Error distribution at the output of 15bit RCA all delays equal after PDB after PDB & PDR “maximally” slow network → “minimally” error-prone network, without energy increase generates a sparse error distribution

error probability error probability error probability error magnitude error magnitude error magnitude

18

SLIDE 19

Seizure Detection using a Support Vector Machine (SVM)

[Verma-JSSC-2010]

SVM Classifier 𝒙 𝒚 𝑐 𝑨 𝒙B𝒚 + 𝑐 ≶ 0

𝑨 = 1 𝑨 = 0

𝒚: feature vector extracted from EEG signals 𝒙: trained weight vector 𝑐: trained scalar bias 𝑨: decision; 𝑨 = 1 ⇒ seizure

CHB-MIT EEG dataset

19

SLIDE 20

Simulation Results

RPE-EST: Reduced precision embedded estimator

Main Block Error Compensator Gate count 52.8k 5.608k

20

SLIDE 21

Fundamental Limits on Energy, Delay and Accuracy

21

[Shannon, 1948] § use abstract models of nanoscale fabrics to § obtain fundamental limits on information processing capacity and § extract design principles to approach those limits (what’s possible, what’s impossible, and the grey area in between)

SLIDE 22

22

Shannon Limits on Energy for ASL Networks

(for 𝜺 −Reliable ASL Networks)

network of 𝜗-noisy gates energy-vs-error rate model

✏ = ✏0 exp(−ceg)

(fits spintronics)

𝐹𝑜𝑓𝑠𝑕𝑧 ≥ 𝑜 𝑙 𝜓@* ln 𝑙 4ln 𝑜 ln 1 1 − ℎ(𝜀)

fundamental limit

error rate energy

design principle

[Chatterjee and Varshney,

Inf. Th. Apps (ITA) Wkshp 2016]

SLIDE 23

device & circuit algorithms & architectures Nanofunctions Beyond-CMOS devices Shannon & Brain-inspired Models of Computing Application requirements prototypes in beyond CMOS

Statistical Information Processing

23

SLIDE 24

Nanofunctions

Move the level of abstraction from a switch to a nanofunction
Nanofunction
More complex than a switch but
Simple enough to be fabricated in emerging technologies with
Behavior derived from system considerations and realizations typically in mixed-signal
E.g., projection kernels (dot product), distance kernels, squashing fns
Benefits:
Expands the nanodevice & system design space
Incorporates “infrastructure” needed to make devices useful, e.g., interconnect
Provides a suitable sandbox for system and device researchers
Challenges:
Identifying a minimal set for the application space

24

SLIDE 25

Devices-to-Nanofunctions Nanofunctions-to-Systems

Restricted Boltzmann Machines resistive memories graphene dot product

Φ1 Φ2 Φ3 Φ4 V1 V2 V3 V4 Vx

Inputs

Nano function with flag

Function computation Flag computation Function

utput

Nanoflag

utput

deep NNs MTJ-based random number generators Stochastic Spin Models Support Vector Machine

Spin Torque Transfer VSS VDD Isupply Ferromagnet Conducting Channel Insulating Partition Input Magnet Output Magnet

CNFET Machine Learning Accelerator Core Graphene MUX logic Oscillatory NNs RRAM Oscillators

[Weldon, Pileggi] [Mitra, Wong] [Pop, Grover] [Shanbhag, Young (Intel)] [Wong] [Weldon, Pileggi] [Pileggi] 25

SLIDE 26

Paving a Path for Device Scaling

Complexity Energy Device Error Rate

[S. Manipatruni, D. Nikonov, I. Young, N. Shanbhag]

Statistical Information Processing 26

SLIDE 27

Key Take-Aways

device-systems co-research
‘move up’ from a switch to nanofunctions
use Shannon/brain-inspired models of

computation

Statistical Information Processing – a

principled approach to non von Neumann computing for end-of-roadmap CMOS and beyond

27

Nanofunctions Beyond-CMOS devices Shannon & Brain-inspired Models of Computing Application requirements

SLIDE 28

A Bridge from Systems-to-Devices

Applications Architectures Circuits Devices Systems

A Systems Approach to Computing in Beyond CMOS Fabrics

Rabaey, J. Weldon, L. Pileggi, S. Manipatruni, D. Nikonov, and I. A. Young

www.sonic-center.org

Machines are Beating Humans at Complex Inference Tasks …

recognition, prediction, classification decision-making ……

… the energy cost is HUGE (e.g., 𝟐𝟏𝟏𝟏𝟏𝒀)

Extracting information from data

Existing Solutions are Reaching their Limits

from CMOS scaling

inference workloads requirements

device architecture

fundamental question

how do we design intelligent machines that

Shannon-inspired Brain-inspired stochastic channel stochastic neural fabric

Key: embrace stochasticity of nanoscale fabrics via Shannon/brain- inspired models of computation

Research Inspiration

reliable systems operating at the limits of energy-efficiency employ stochastic components & statistical models of computation

Statistical Information Processing

A principled approach to non von Neumann computing

A Journey from Systems-to-Devices

Stochastic nanofunctions, models, prototypes Information-based metrics, fundamental limits, design principles

Nanofunctions Beyond-CMOS devices device & circuit algorithms & architectures Shannon & Brain-inspired Models of Computing Application requirements prototypes in beyond CMOS

Statistical Information Processing

Systems in Beyond CMOS using Shannon & Brain-inspired Models of Computing

Hyperdimensional (HD) Computing

HD mapper HD Algebra input

𝑇 ⃗ = 𝐵 ⃗ + 𝐶 + 𝐷 ⃗ (𝑇 ⃗ is similar to 𝐵 ⃗, 𝐶 and 𝐷 ⃗) 𝑇 ⃗ = 𝐵 ⃗ ∗ 𝐶 ∗ 𝐷 ⃗ (𝑇 ⃗ is dissimilar to all 𝐵 ⃗, 𝐶 and 𝐷 ⃗)

hypervectors

MAP

RRAM Stochasticity for Random Mapping

Binary bits statistically produced/stored in RRAMs

3D VRRAM as In-Memory Computing Kernels

VRRAM: Vertical Resistive Random Access Memory

In-Memory MAP Kernels using 3D RRAM

Addition

Permutation

3D VRRAM In-Memory HD Architecture

Shannon-inspired Spintronics

CMOS SPIN

competitive with CMOS?

costs much energy & delay

the stochastic regime

Systems-friendly Nanomagnetic Model

𝜗 𝐹, 𝑈

> ≈ 𝑓@ AB

𝜗: Switching error rate 𝐹: Switching energy 𝑗: Current overdrive factor 𝐽HIJK: Critical supply current

𝐹 𝑗, 𝑈

𝑆𝑈

40× reduction in switching energy as 𝜗 scales from 10@*P to 10@*

The 𝝑-Noisy Spin Gate Model

𝜄 ∈ {0,1}: Bernoulli random variable

Pr 𝜄 = 1 = 𝜗

1

A

B

C +

" #$%#& '( = *+,+ '-

!

# # + !

Gate is OFF Gate is ON Switching occurs Gate is OFF

Shannon-inspired Statistical Error Compensation (SEC)

!

!

Desired property: disparity between error probability mass functions

𝑧 _ = MAP estimate of 𝑧` Fusion block sparse dense

Enhancing Disparity in Error Probablity Mass Functions

Path Delay Balancing (PDB) Path Delay Redistribution (PDR)

Seizure Detection using a Support Vector Machine (SVM)

SVM Classifier 𝒙 𝒚 𝑐 𝑨 𝒙B𝒚 + 𝑐 ≶ 0

𝒚: feature vector extracted from EEG signals 𝒙: trained weight vector 𝑐: trained scalar bias 𝑨: decision; 𝑨 = 1 ⇒ seizure

Simulation Results

RPE-EST: Reduced precision embedded estimator

Fundamental Limits on Energy, Delay and Accuracy

[Shannon, 1948] § use abstract models of nanoscale fabrics to § obtain fundamental limits on information processing capacity and § extract design principles to approach those limits (what’s possible, what’s impossible, and the grey area in between)

Shannon Limits on Energy for ASL Networks

(for 𝜺 −Reliable ASL Networks)

✏ = ✏0 exp(−ceg)

𝐹𝑜𝑓𝑠𝑕𝑧 ≥ 𝑜 𝑙 𝜓@* ln 𝑙 4ln 𝑜 ln 1 1 − ℎ(𝜀)

device & circuit algorithms & architectures Nanofunctions Beyond-CMOS devices Shannon & Brain-inspired Models of Computing Application requirements prototypes in beyond CMOS

Statistical Information Processing

40× reduction in switching energy as 𝜗 scales from 10@P to 10@