RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, - - PowerPoint PPT Presentation

resistive computation avoiding the power wall with low
SMART_READER_LITE
LIVE PREVIEW

RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, - - PowerPoint PPT Presentation

RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, STT-MRAM BASED COMPUTING Xiaochen Guo , Engin Ipek, and Tolga Soyata Rochester Computer Systems Architecture Laboratory Multicore Scaling Limited by Power 2 Traditional MOSFET


slide-1
SLIDE 1

Xiaochen Guo, Engin Ipek, and Tolga Soyata

Rochester Computer Systems Architecture Laboratory

RESISTIVE COMPUTATION: AVOIDING THE POWER WALL WITH LOW-LEAKAGE, STT-MRAM BASED COMPUTING

slide-2
SLIDE 2

Multicore Scaling Limited by Power

6/21/12

2

 Traditional MOSFET scaling theory relies on

reducing VDD in proportion to device dimensions

 VDD has scaled very slowly since 90nm  Multicore scaling severely challenged by power

P = Pdynamic + Pstatic = N (Ceff VDD

2  f + Ileak VDD)

Pdynamic = N (Ceff VDD

2  f )

Ileak∝ e-Vth 2x 2x 1.4x 1.4x 1.4x

slide-3
SLIDE 3

Our Approach: Resistive Computation

6/21/12

3

 Opportunity: spin-torque transfer magnetoresistive

RAM (STT-MRAM)

 Near-zero leakage power  Low-energy read operation

 Goal: selectively migrate on-chip storage and

combinational logic to STT-MRAM to reduce power

 On-chip storage

 Caches, TLBs, RF, queues

 Combinational logic

 Lookup-table (LUT) based computing

slide-4
SLIDE 4

STT-MRAM

6/21/12

4

 Desirable properties

 CMOS compatibility  Read speed as fast as SRAM  Density comparable to DRAM  Unlimited write endurance

 Key challenge: expensive writes  Long switching latency (6.7ns @ 32nm)  High switching energy (0.3pJ/bit @ 32nm)

+

  • Vwrite

Value = 1

  • +

Vwrite Value = 0

  • +

Vread MTJ Access transistor

slide-5
SLIDE 5

Switching Time vs. Cell Size

 Faster switching with

wider access transistors

+ Faster writes - Slower reads - Lower density - Higher read energy

6/21/12

5

RF, L1D$ L2$, L1I$, LUTs, TLBs, MC Queues

slide-6
SLIDE 6

RAM Arrays and Lookup Tables

Fundamental Building Blocks

slide-7
SLIDE 7

 Problem: low write throughput  Existing solutions incur high overheads to sustain

adequate write throughput in STT-MRAM arrays

STT-MRAM Arrays

6/21/12

7

Multiporting

Banking

slide-8
SLIDE 8

STT-MRAM Arrays

 CMOS subbank buffers

 Latch in addr/data and

release H-tree; complete write locally

 Allow forwarding from

  • ngoing writes

 Facilitate local differential

writes

 Reads access subbank via

exclusive read port

6/21/12

8

slide-9
SLIDE 9

STT-MRAM LUTs [Suzuki09, Matsunaga08]

 Store truth tables of logic

functions directly in STT-MRAM

 Benefits

 Leakage confined to

peripheral circuitry

 Low-power (low-swing)

lookups

 Fast lookups using sense amp

 Logic functions with many

minterms can utilize LUTs effectively

6/21/12

9

slide-10
SLIDE 10

Case Study: 3-bit Adder

6/21/12

10

slide-11
SLIDE 11

Pipeline Organization

slide-12
SLIDE 12

Hybrid CMT Pipeline

6/21/12

12

Small arrays and simple logic in CMOS Large arrays and complex logic in STT- MRAM

slide-13
SLIDE 13

Front End

6/21/12

13

LUT-based carry- select adder to compute PC+4 LUT-based front-end thread selection logic SRAM-based refill queue to avoid I$ conflicts Predecode and back- end thread selection with MRAM-related stall conditions

slide-14
SLIDE 14

Register File

6/21/12

14

Architectural registers of all threads aggregated in a unified STT- MRAM array to amortize subbank buffers Registers of a single thread striped across subbanks to reduce subbank buffer conflicts

slide-15
SLIDE 15

Floating-Point Unit

6/21/12

15

STT-MRAM FPU CMOS FPU Add, Sub, Mult 24 cycles 12 cycles Div 64 cycles 64 cycles

slide-16
SLIDE 16

Memory System

6/21/12

16

Use store buffers to avoid L1 D$ subbank conflicts L1s optimized for fast writes using 30F2 cells L2 and memory controllers optimized for density using 10F2 cells

slide-17
SLIDE 17

Evaluation

slide-18
SLIDE 18

Performance

6/21/12

18

slide-19
SLIDE 19

Power

6/21/12

19

slide-20
SLIDE 20

Contributions and Findings

6/21/12

20

 New technique to reduce leakage and dynamic

power in a deep-submicron microprocessor

 Selectively migrate on-chip storage and combinational

logic from CMOS to STT-MRAM

 Use subbank buffers to alleviate long write latency

 STT-MRAM is an attractive low-power solution

beyond 32nm

 Dramatically lower leakage power  Modest loss in performance