A Systems Approach to Computing in Beyond CMOS Fabrics A. D. - - PowerPoint PPT Presentation

a systems approach to computing in beyond cmos fabrics
SMART_READER_LITE
LIVE PREVIEW

A Systems Approach to Computing in Beyond CMOS Fabrics A. D. - - PowerPoint PPT Presentation

A Systems Approach to Computing in Beyond CMOS Fabrics A. D. Patil, N. R. Shanbhag, L. R. Varshney, E. Pop, H.-S. P. Wong, S. Mitra, J. Rabaey, J. Weldon, L. Pileggi, S. Manipatruni, D. Nikonov, and I. A. Young A Center


slide-1
SLIDE 1

A Center

A Systems Approach to Computing in Beyond CMOS Fabrics

  • A. D. Patil, N. R. Shanbhag,
  • L. R. Varshney, E. Pop, H.-S.
  • P. Wong, S. Mitra, J.

Rabaey, J. Weldon, L. Pileggi, S. Manipatruni, D. Nikonov, and I. A. Young

www.sonic-center.org

slide-2
SLIDE 2

Machines are Beating Humans at Complex Inference Tasks …

recognition, prediction, classification decision-making ……

DATA INFORMATION BMW Sasi Ian [The Guardian, May 2017]

… the energy cost is HUGE (e.g., 𝟐𝟏𝟏𝟏𝟏𝒀)

Extracting information from data

2

slide-3
SLIDE 3

Existing Solutions are Reaching their Limits

  • diminishing energy-delay benefits

from CMOS scaling

  • variations dominate
  • von Neumann architecture mismatched to

inference workloads requirements

  • data movement problem (memory wall)

𝑢ℎ

[Pop-NanoResearch-2010]

Row decoder Row decoder Digital processor Decision (𝒛)

K-b bus

d0 WL driver Precharge SA SA SA Mux & buffer Memory

∆𝑾𝑪𝑴

L:1

  • col. mux

L:1

  • col. mux

L:1

  • col. mux

Input buffer (𝑸) d1 d2 d3 … …

device architecture

memory interface processor

3

slide-4
SLIDE 4

fundamental question

how do we design intelligent machines that

  • perate at the limits of energy?

4

slide-5
SLIDE 5

Shannon-inspired Brain-inspired stochastic channel stochastic neural fabric

Key: embrace stochasticity of nanoscale fabrics via Shannon/brain- inspired models of computation

Research Inspiration

reliable systems operating at the limits of energy-efficiency employ stochastic components & statistical models of computation

5

slide-6
SLIDE 6

C

y ˆ x

Estimator/ Detector

1

y

2

y

N

y

  • bservations

corrected

  • utput

, ( , )

P

e h e h

application-derived metric

= max 𝑄(𝑧|𝑧*, . . , 𝑧-) Principles of Statistical Information Processing Prototypes In CMOS & Beyond CMOS

Statistical Information Processing

A principled approach to non von Neumann computing

A Journey from Systems-to-Devices

to Shannon/brain-inspired architectures to

Stochastic nanofunctions, models, prototypes Information-based metrics, fundamental limits, design principles

6

fundamental limits on energy efficiency

FR row decoder FR row decoder

ADC & RDL

BLP SA SA SA Mux & buffer

K-b bus

Input buffer ( ) Cross BL processor (CBLP) Decision ( ) BLP BLP BLP BLP

L:1

  • col. mux

L:1

  • col. mux

d0 d1 d2 d3 BLP

L:1

  • col. mux

Precharge WL driver

deep in-memory arch.

slide-7
SLIDE 7

Nanofunctions Beyond-CMOS devices device & circuit algorithms & architectures Shannon & Brain-inspired Models of Computing Application requirements prototypes in beyond CMOS

Statistical Information Processing

7

slide-8
SLIDE 8

Systems in Beyond CMOS using Shannon & Brain-inspired Models of Computing

  • hyperdimensional (HD) computing via 3D VRRAM– brain-inspired
  • Shannon-inspired spintronics

8

slide-9
SLIDE 9

Hyperdimensional (HD) Computing

  • almost all vectors are nearly-orthogonal
  • requires random mapping
  • related to Shannon’s random codes

HD mapper HD Algebra input

  • utput

4-bit HD ≈ 10000 -bit

𝑇 ⃗ = 𝐵 ⃗ + 𝐶 + 𝐷 ⃗ (𝑇 ⃗ is similar to 𝐵 ⃗, 𝐶 and 𝐷 ⃗) 𝑇 ⃗ = 𝐵 ⃗ ∗ 𝐶 ∗ 𝐷 ⃗ (𝑇 ⃗ is dissimilar to all 𝐵 ⃗, 𝐶 and 𝐷 ⃗)

  • orthogonalization via local Multiply (XOR)
  • summarization via local Add (OR)
  • ordered summarization via Permutation

hypervectors

9

[Rabaey, Olshausen, Mitra, Wong]

MAP

slide-10
SLIDE 10

RRAM Stochasticity for Random Mapping

Low resistance state: 1 High resistance state: 0

Oxygen vacancy (VO) Stochastic VO

Top electrode

VTE < VSET PSET < 1

Bottom electrode Bottom electrode Top electrode

1001100111……0100111101 (hyper-vector)

PSET: SET probability (switching from ‘0’ to ‘1’)

Binary bits statistically produced/stored in RRAMs

0.7 0.8 0.9 1.0 1.1 10

2

10

3

10

4

Pulse Amplitude (V) Pulse Width (ns)

0. 0. 0. 0. 1.

0% 100% 50% 25% 75%

50%

Experimental data

[H. Li,…, H.-S. P. Wong, IEDM, 2016] 10

slide-11
SLIDE 11

3D VRRAM as In-Memory Computing Kernels

VRRAM: Vertical Resistive Random Access Memory

TiN TiN TiN (BE) TiN

TiN/Ti

(TE) Layer 1 (L1) Layer 2 (L2) Layer 3 (L3) Layer 4 (L4)

FinFET

Pillar electrode Plane electrode Word line (WL) RRAM cell Bit line (BL) Select line (SL)

z x y

1T-4R

50 nm TiN (20 nm) TiN/Ti (50 nm)

Fab by NDL, Taiwan

11

slide-12
SLIDE 12

In-Memory MAP Kernels using 3D RRAM

[H. Li,…, H.-S. P. Wong, IEDM, 2016]

1 1 1 1 1 1 1 1

Multiplication

1 1 1 1 1 1 1 1 1

Addition

1 1

Permutation

1M

1k 1M 1G 1T

1M 100k 10k

Resistance Logic Evaluation

C = 0 D = 1

A B C D 1 1 Input AB = pillar addr. = 10

Logic Evaluation Cycle (#) Resistance (W)

10

1

10

3

10

5

10

7

10

9

10

11

7 14 21 28

Current (µA) Addition Cycle (#)

1111 0111 0011 0001 0000

1 1 1

1 2 3 4

1

VDD gnd VDD gnd VDD gnd

L1 L2 L4 L3

200 ns

1 1 1 1 1 1 1 1 1 1 1 1

VDD gnd gnd VDD VDD gnd

200 ns

L1 L2 L4 L3

4 5 6 7 1

Measured HRS (400kΩ-1MΩ) Measured LRS (~10kΩ)

1

1

Bit 1 up Bit 0 down

12

slide-13
SLIDE 13
  • Device-architecture co-design for 3D VRRAM-centric HD computing
  • MAP operations of HD computing mapped onto 3D VRRAM arrays
  • > 400 × area reduction (28-nm node) compared with digital design [1]

3D VRRAM In-Memory HD Architecture

Letter (3 layers) Trigram (5 layers) XOR (1 layer) LangMap (6 layers) HamD Measure (21 layers) Input texts 3-letter sequences Compute trigrams MAP Generate (learn) language/text maps (one for each text) Training: finish Inference: measure HamD & identify the ‘nearest’ Binding MAP (addition) (HamD)

Algorithm Architecture Device

One-shot learning

4 kb ´ 36 layers

Random HD vectors Sampling Projection

PERM ADD XOR XOR Store ADD ADD ADD XOR XOR Store

MAP

PSET » 50% [[1] A. Rahimi et al., ISLPED, p.64, 2016]

10

2

10

3

10

4

10

5

10

6

Total Area (µm

2)

28nm LP 3D VRRAM

1 kb 2 kb 10 kb

HD Vector Size

9.16E5 1.78E6 2223 2691 3394

412´ 660´

1 kb 2 kb 10 kb 400 800 1200 1600

Component Area (µm

2)

HD Vector Size

Routing: + 1699.7 Routing: + 2081.5 Routing: + 2158.1

Cell array SA MUX Decoder

(a) (b)

38 13

slide-14
SLIDE 14

Shannon-inspired Spintronics

  • But first…......need system-friendly abstract models of spin devices

[Nikonov-JXCDC-2015]

stochastic regime

CMOS SPIN

deterministic regime

  • Can we make all-spin logic

competitive with CMOS?

  • deterministic nanomagnet switching

costs much energy & delay

  • need to design spintronic systems in

the stochastic regime

[Patil, Shanbhag, Manipatruni, Nikonov, Young, MMM-Intermag’16, arXiv’17]

14

slide-15
SLIDE 15

Systems-friendly Nanomagnetic Model

𝜗 𝐹, 𝑈

> ≈ 𝑓@ AB

C

  • [Butler-TMAG-2012]

𝜗: Switching error rate 𝐹: Switching energy 𝑗: Current overdrive factor 𝐽HIJK: Critical supply current

𝐹 𝑗, 𝑈

> = 𝑗L𝐽HIJK L

𝑆𝑈

>

40× reduction in switching energy as 𝜗 scales from 10@*P to 10@*

energy numbers from [Manipatruni, et al., Physical Review Applied’16]

15

𝟗× 𝟗. 𝟒× 𝟔× 𝟕× 𝜗 𝜗 𝜗 𝜗 𝜗 𝜗 𝜗 𝜗 𝜗 𝜗 = 0.5

slide-16
SLIDE 16

The 𝝑-Noisy Spin Gate Model

  • includes input dependence
  • enables evaluation of complex networks

𝜄 ∈ {0,1}: Bernoulli random variable

Pr 𝜄 = 1 = 𝜗

q

{ }

1

t t t

C A B =

t

A

t

B

g

t T

C +

1

!-noisy non-volatile AND gate Error-free gate Virtual gate operation emulating !-noise

" #$%#& '( = *+,+ '-

!

"

# # + !

"

Gate is OFF Gate is ON Switching occurs Gate is OFF

16

slide-17
SLIDE 17

Shannon-inspired Statistical Error Compensation (SEC)

!

"($)

!

&(')

Desired property: disparity between error probability mass functions

error compensator is robust: 𝜗 < 10@^

𝑧 _ = MAP estimate of 𝑧` Fusion block sparse dense

error compensator is efficient: 𝑚 > 𝑟 & 𝑙 > 𝑛

[Zhang, Shanbhag, IEEE Trans. Signal Processing, 2016] [Abdallah, Shanbhag, IEEE J. solid-state circuits, 2013] [Gonugondla, Shim, Shanbhag, ICASSP, 2016]

17

slide-18
SLIDE 18

Enhancing Disparity in Error Probablity Mass Functions

Path Delay Balancing (PDB) Path Delay Redistribution (PDR)

Error distribution at the output of 15bit RCA all delays equal after PDB after PDB & PDR “maximally” slow network → “minimally” error-prone network, without energy increase generates a sparse error distribution

error probability error probability error probability error magnitude error magnitude error magnitude

18

slide-19
SLIDE 19

Seizure Detection using a Support Vector Machine (SVM)

[Verma-JSSC-2010]

SVM Classifier 𝒙 𝒚 𝑐 𝑨 𝒙B𝒚 + 𝑐 ≶ 0

𝑨 = 1 𝑨 = 0

𝒚: feature vector extracted from EEG signals 𝒙: trained weight vector 𝑐: trained scalar bias 𝑨: decision; 𝑨 = 1 ⇒ seizure

CHB-MIT EEG dataset

19

slide-20
SLIDE 20

Simulation Results

RPE-EST: Reduced precision embedded estimator

Main Block Error Compensator Gate count 52.8k 5.608k

20

slide-21
SLIDE 21

Fundamental Limits on Energy, Delay and Accuracy

21

[Shannon, 1948] § use abstract models of nanoscale fabrics to § obtain fundamental limits on information processing capacity and § extract design principles to approach those limits (what’s possible, what’s impossible, and the grey area in between)

slide-22
SLIDE 22

22

Shannon Limits on Energy for ASL Networks

(for 𝜺 −Reliable ASL Networks)

network of 𝜗-noisy gates energy-vs-error rate model

✏ = ✏0 exp(−ceg)

(fits spintronics)

𝐹𝑜𝑓𝑠𝑕𝑧 ≥ 𝑜 𝑙 𝜓@* ln 𝑙 4ln 𝑜 ln 1 1 − ℎ(𝜀)

fundamental limit

error rate energy

design principle

[Chatterjee and Varshney,

  • Inf. Th. Apps (ITA) Wkshp 2016]
slide-23
SLIDE 23

device & circuit algorithms & architectures Nanofunctions Beyond-CMOS devices Shannon & Brain-inspired Models of Computing Application requirements prototypes in beyond CMOS

Statistical Information Processing

23

slide-24
SLIDE 24

Nanofunctions

  • Move the level of abstraction from a switch to a nanofunction
  • Nanofunction
  • More complex than a switch but
  • Simple enough to be fabricated in emerging technologies with
  • Behavior derived from system considerations and realizations typically in mixed-signal
  • E.g., projection kernels (dot product), distance kernels, squashing fns
  • Benefits:
  • Expands the nanodevice & system design space
  • Incorporates “infrastructure” needed to make devices useful, e.g., interconnect
  • Provides a suitable sandbox for system and device researchers
  • Challenges:
  • Identifying a minimal set for the application space

24

slide-25
SLIDE 25

Devices-to-Nanofunctions Nanofunctions-to-Systems

Restricted Boltzmann Machines resistive memories graphene dot product

Φ1 Φ2 Φ3 Φ4 V1 V2 V3 V4 Vx

Inputs

Nano function with flag

Function computation Flag computation Function

  • utput

Nanoflag

  • utput

deep NNs MTJ-based random number generators Stochastic Spin Models Support Vector Machine

Spin Torque Transfer VSS VDD Isupply Ferromagnet Conducting Channel Insulating Partition Input Magnet Output Magnet

CNFET Machine Learning Accelerator Core Graphene MUX logic Oscillatory NNs RRAM Oscillators

[Weldon, Pileggi] [Mitra, Wong] [Pop, Grover] [Shanbhag, Young (Intel)] [Wong] [Weldon, Pileggi] [Pileggi] 25

slide-26
SLIDE 26

Paving a Path for Device Scaling

Complexity Energy Device Error Rate

[S. Manipatruni, D. Nikonov, I. Young, N. Shanbhag]

Statistical Information Processing 26

slide-27
SLIDE 27

Key Take-Aways

  • device-systems co-research
  • ‘move up’ from a switch to nanofunctions
  • use Shannon/brain-inspired models of

computation

  • Statistical Information Processing – a

principled approach to non von Neumann computing for end-of-roadmap CMOS and beyond

27

Nanofunctions Beyond-CMOS devices Shannon & Brain-inspired Models of Computing Application requirements

slide-28
SLIDE 28

A Bridge from Systems-to-Devices

Applications Architectures Circuits Devices Systems

Thank You!