RF-Interconnect RF-Interconnect and its Applications to and its - - PDF document

rf interconnect rf interconnect and its applications to
SMART_READER_LITE
LIVE PREVIEW

RF-Interconnect RF-Interconnect and its Applications to and its - - PDF document

RF-Interconnect RF-Interconnect and its Applications to and its Applications to NoC Design NoC Design NOCS Tutorial Course, NOC Tutorial Course, May 10 2009 y 10 2009 San Diego, California San Diego, Californ Frank Chang, Jason Cong and


slide-1
SLIDE 1

1

1

RF-Interconnect RF-Interconnect and its Applications to and its Applications to NoC Design NoC Design

Frank Chang, Jason Cong and Glenn Reinman

E-mails: mfchang@ee.ucla.edu cong@cs.ucla.edu Glenn.reinman@cs.ucla.edu NOC NOCS Tutorial Course, Tutorial Course, May 10 2009 y 10 2009 San Diego, Californ San Diego, California

2

RF-Interconnect RF-Interconnect

slide-2
SLIDE 2

2

3

Outline

  • Future Network-on-Chip (NoC) needs and

development trends

  • Traditional baseband-interconnect constraints
  • Multiband RF-Interconnect (RF-I) advantages

– Scalability in latency, energy/bit, data rate (Gbps/link) and overhead (area/Gb) – On-chip demonstrations – Off-chip demonstrations – Remaining technology challenges

  • Potential RF-I system applications

4

Current Trend in CMP

  • 65nm CMOS 80 tile NoC
  • 10X8 2D mesh network-on-

chip running @ 4GHz

  • Bisection bandwidth

256GB/s

  • 1 TFLOPS @ 1V about 98W
  • Needs total 75 Clk cycles

from the lower left corner to the upper right corner

ISSCC 2007: An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS (Sriram Vangal et al., Intel)

slide-3
SLIDE 3

3

5

Future CMP Development

Trends:

  • Heterogeneous/domain-specific system architecture
  • Many-core massive parallel data processing
  • System integration in deep-scaled CMOS technology
  • Low supply voltage with sub-Vth digital operation

Issues:

  • Performance increasingly dependent on inter-core or

inter-system communications

6

Scaling of Traditional Interconnect

  • Scaling reduces delay of logic gates but not of wires
  • Latency is RC limited (~L2)
  • Using CMOS repeaters reduces latency (~L) but receives no benefit

from scaling

  • Even low swing signaling requires extensive equalization
  • Waste of broad bandwidth available from modern CMOS devices

(ft>150GHz, fmax>250GHz)

10

T

f

slide-4
SLIDE 4

4

7

Baseband Interconnect Issues

  • Latency is large across the chip
  • Bandwidth is RC limited (~1Gbps/wire)
  • Communication pattern is fixed (non-reconfigurable)
  • Energy consumption is high and not scalable

(~10pJ/bit/cm)

  • At 22nm technology, the total network power using

buffer can be as high as 150W*

  • Future microprocessors may encounter

communication congestion and most of the energy will be spent on “talking” instead of computing

*“Research Challenges for On-Chip Interconnection Networks,” IEEE Micro, 2007

8

Communication Challenges

  • On-Chip Issues

– # Cores in Chip-Multiprocessor (CMP) growing

  • Increasing bandwidth demand on interconnect

– Wires scaling poorly compared to transistors

  • Increased latency to communicate between distant points on

CMP

  • Off-chip limited by chip-to-chip, board-to-board, board-to-backplane

communications

  • Requirements on future interconnect

– Scalable, reliable – Support high traffic volume with low latency – Constrained by

  • Power
  • Silicon Area
  • Cost (compatibility with mainstream CMOS technology)
slide-5
SLIDE 5

5

9

How Can RF Help?

  • fT will exceed 600GHz

at16nm and fmax will even approach 1THz!

  • Millimeter-wave CMOS

circuits have been developed for 60GHz and recently for 324 GHz bands*

  • Incredible bandwidth is

available in future but most people neglect that!

  • EM waves travel at the

(effective) speed of light (~7ps/mm in Silicon)

*Huang, Larocca and Chang, “324GHz CMOS Frequency Generator using Linear Superposition Technique,” pp. 476- 477, 2008 ISSCC

10

  • 100
  • 90
  • 80
  • 70

323.038 323.238 323.438 323.638 323.838 324.0

Frequency (GHz) Pout (dBm)

UCLA 90nm CMOS VCO at 324GHz

(ISSCC 2008*)

CMOS Voltage Controlled Oscillator, measured with a subharmonic mixer and driven with a 80 GHz synthesizer local oscillator. The mixing frequency is (fVCO - 4*fLO)=fIF, or fVCO -4*(80 GHz)= 3.5 GHz, yielding fVCO= 323.5 GHz! On-Wafer VCO Test Setup at JPL

CMOS VCO designed by Frank Chang’s group at UCLA, fabricated in 90nm process 323.5GHz VCO

*Huang, D., LaRocca T., Chang, M.-C. F., “324GHz CMOS Frequency Generator Using Linear Superposition Technique IEEE International Solid-State Circuits Conference (ISSCC), 476-477, (Feb 2008) San Francisco, CA

slide-6
SLIDE 6

6

11

4xf0 by Linear Superposition

4xfo by Linear Superposition

12

Communication beyond Baseband

  • Ultra-high carrier frequencies can be

generated and modulated by modern CMOS to enable simultaneous multiband communications with higher aggregate data rate

  • On/off-chip Transmission lines and off-

chip near-field antennas can guide waves (RF modulated data) from transmitter to receiver with recoverable attenuation in short distances (<30cm)

slide-7
SLIDE 7

7

13

  • Carrier frequencies can be generated and

modulated using modern CMOS to enable simultaneous multiband communications with higher aggregate data rate

  • Higher carrier frequencies can avoid baseband

digital noise and cause less frequency dispersion across the band

Multiband Communications

14 Bi-directional Bus

Advantages:

  • Higher combined

data rate

  • Simultaneous,

bi-directional communications

  • Re-configurable

between bands

  • Low in-band

coupling for parallel bus

  • Potentially with

fewer I/O pins and smaller routing area

RF-Interconnect Concept

f

slide-8
SLIDE 8

8

15

  • Loss of 1.5dB/mm over 100GHz
  • f Bandwidth

Differential TL: IBM 90nm Process Width: 3um Spacing: 3um Total Thickness: Two Top Metal = 1µm Metal Resistivity: 0.0424Ohm/Sq 3um 3um 3um 0.5um 0.5um

M8 M7

Differential Transmission Line

16

Multiband FDMA-Interconnect

  • In TX, each mixer up-converts individual baseband streams into

specific frequency band (or channel)

  • N different data streams (N=6 in exemplary figure above) may

transmit simultaneously on the shared transmission medium to achieve higher aggregate data rates

  • In RX, individual signals are down-converted by mixer, and

recovered after low-pass filter

Signal Spectrum

Signal Power Signal Power Signal Power Signal Power
slide-9
SLIDE 9

9

17

D a t a D a t a

N

  • 2

Data1

power

FDMA-Interconnect System

18

Incredible CMOS Bandwidth

100 200 300 400 500 600 700 800 900 1000 20 30 40 50 60 70 channel length (nm) freq (G H z )

ft_DRAM fmax_DRAM ft_NFET fmax_NFET

Technology (nm) 90 65 45 32 22 16 Number of Cores 8 16 40 64 120 189 BW (Bisection) [GB/s] 91 128 202 256 350 440 Chip Power (total) [W] 100 120 144 173 207 249 fmax [GHz] 200 270 370 480 590 710 fvco [GHz] 320 432 592 768 944 1136 Max Aggregate Data Rate [Gb/s/wire] 160 216 296 384 472 568

Maximum aggregate data rate for RF-Interconnect can reach 500Gb/s/wire @16nm Tech Node

slide-10
SLIDE 10

10

19

Advantages of RF-I over Parallel Bus

  • Latency – speed-of-light data transmission
  • Bandwidth – high aggregate data rate through

simultaneous transmissions on multiple bands of RF modulated signals

  • Area – avoid extensive use of repeaters
  • Energy – low overall energy bit
  • Reconfigurability – efficient bidirectional and

tunable communications via shared on/off-chip transmission lines or off-chip antennas

20

RF-Interconnect Demonstrations

  • Off-chip (On-board) Simultaneous Dual-

band Communications through RF- Interconnect (ISSCC 05)

  • Inter-layer 3DIC RF-Interconnect (ISSCC 07)
  • On-chip Simultaneous generation of multi-

band carriers (RFIC 08)

  • On-Chip Tri-band simultaneous

communications (VLSI 09)

slide-11
SLIDE 11

11

21

Off-Chip FDMA Links (ISSCC 05*)

  • 2 carrier RF-I provide

simultaneous off chip between 4 CMOS chip in 0.18um technology

  • 1 baseband and 1 RF band

at 7.4GHz

  • Selectivity between bands is

achieved using bandpass or lowpass filtering.

  • The RF carrier was

modulated using BPSK.

  • Using this scheme,

simultaneous data rates of (2+2) Gb/s were achieved in both the baseband (2Gbps) and the RF band (2Gbps).

*J. Ko, J. Kim, Z. Xu, Q. Gu, C. Chien, and M.F. Chang, “An RF/Baseband FDMA-Interconnect Transceiver for Reconfigurable Multiple Access Chip-to-Chip Communication,” in 2005 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, February 2005

22

3D-IC Layer-to-Layer RF- Interconnect (ISSCC 07*)

  • 3DIC RF-I in MIT-

Lincoln Lab 0.18 μm 3DIC technology

  • Data is modulated

with amplitude shift keying (ASK) modulation of a 25GHz carrier

  • low energy-per-bit:

0.39pJ/bit

  • high data rate: 11Gb/s

22

(a) Schematic of 3DIC RF-I (b) Eye diagram with 11Gb/s data rate (c) Die photo of the 3DIC RF-I

  • Q. Gu, Z. Xu, J. Ko and M.F. Chang, "Two 10Gbps/pin Low Power Interconnect Methods for 3D

IC", 2007 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, vol.50, pp.448-449, Feb. 2007, San Francisco, California, USA

slide-12
SLIDE 12

12

23

Simultaneous Sub-harmonic Injection Locked mm-Wave Frequency Generation *(RFIC 2008)

  • Using sub-harmonic injection-

locked VCOs simultaneously lock to one single reference frequency

  • Advantages:

– Eliminate multiple PLLs – Low Power Consumption – Small Area

Master VCO

Non-linear Harmonic Generator

Slave VCOs

*Sai-Wang Tam, Eran Socher, Alden Wong, Yu Wang, Lan Vu, M.F. Chang, "Simultaneous Sub-harmonic Injection- Locked mm-Wave Frequency Generators for Multi-band Communications in CMOS", IEEE RFIC Sym., 2008

24

Sub-harmonic Injection Locked VCO*

(RFIC 2008)

  • LC-based VCO core
  • Differential pair for odd harmonic generation
  • Single-ended for even harmonic generation
  • Injection locking to high harmonic within locking range
  • f the VCO

Process Free Running Frequency (GHz) Max locking Range (GHz) Locking Harmonics Power (mW) This Work* IBM 90nm CMOS 29.3 5.6 2nd,4th, 6th, 8th 3rd, 5th, 7th 4

*Sai-Wang Tam, Eran Socher, Alden Wong, Yu Wang, Lan Vu, M.F. Chang, "Simultaneous Sub-harmonic Injection- Locked mm-Wave Frequency Generators for Multi-band Communications in CMOS", IEEE RFIC Sym., 2008

slide-13
SLIDE 13

13

25 25

Simultaneous Sub-harmonic Injection Locked Multi-Frequency Generation (RFIC 2008)

  • Using sub-harmonic injection-locked VCOs simultaneous lock to one

single master reference frequency

  • Advantages:

– Eliminate Multiple PLLs – Low Power Consumption – Small in Silicon Area

  • Demonstrated Sub-harmonic 30GHz and 50GHz injection-locked VCO

in IBM 90nm Process

Master VCO

Non‐linear Harmonic Generator

Slave VCO

(a) Output Spectrum of the 30GHz and 50GHz VCO simultaneously locked with the same reference source at 9.7GHz (b) Die Photo of the 30GHz Sub-harmonic Injection VCO

25 26

RF-I using ASK Modulation

  • TX: The Transformer couples the output of the VCO to the ASK

modulator and use a simple modulator to generate ASK signal

  • RX: The differential-mutual-mixer (self-mixer) acts as the

envelope detector. Then a simple buffer and Schmitt Trigger recover the signal to rail-to-rail swing.

  • Don’t Need Carrier Synchronization!
slide-14
SLIDE 14

14

27

Simulated RF-I using ASK Modulation

VCO Output: 60GHZ ASK modulated Signal Mixer output 5Gbit/s Data input 28

Tri-Band On-Chip RF-Interconnect

  • IBM 90nm process
  • 5mm Differential Transmission Line
  • Total 3 Channels: 2RF + 1Baseband
  • Differential mode for RF: 30GHz and 50GHz
  • Common Mode for Baseband
  • Total Aggregate Data Rate is 10Gb/s

50GHz TX 30GHz TX Base Band TX 50GHz RX 30GHz RX Base Band RX

slide-15
SLIDE 15

15

29

Tri-band FDMA-Interconnect Layout

30

Tri-band On-Chip RF-I Test Results

30GHz Channel 50 GHz Channel

30GHz Channel 50GHz Channel Base Band Channel

Process IBM 90nm CMOS Digital Process Total 3 Channels 30GHz, 50GHz, Base Band Data Rate in each channel RF Band: 4Gbps Base Band: 2Gbps Total Data Rate 10Gbps Bit Error Rate Across all Bands <10E‐9 Latency 6 ps/mm Enegry Per Bit (RF) 0.09*pJ/bit/mm Enegry Per Bit (BB) 0.125pJ/bit/mm

Data Output waveform Output Spectrum of the RF- Bands, 30GHz and 50GHz *VCO power (5mW) can be shared by all (many tens) parallel RF-I

links in NOC and does not burden individual link significantly.

slide-16
SLIDE 16

16

31

Inter-channel Modulation in multi-band ASK Transmitter

  • Switch in f2 is able to directly

modulate the signal from f1 ⇒ Cause severe Inter-channel interference

  • Additional Transformer avoids

signal current flowing through the switch in other channel => No Inter-channel interference through direct modulation

32

Base Band Common Mode Interconnect*

  • Base Band is transmitted in Common Mode
  • Using Capacitive Coupling method:
  • Common Mode Swing is controlled to be about 100mV
  • Low Swing and save power

“A 5.6mW 1-Gbps pair Pulse Signaling Transceiver for a Fully AC Coupled Bus”, Jongsun Kim, Ingrid Verbauwhede, Mau-Chung Frank Chang, JSSC, VOL 40, No 6, June 2005

slide-17
SLIDE 17

17

33

  • Differential mode for RF

communications

– Using inductive coupling with band-pass characteristic – It is able to filter out the undesired channel

  • Common mode for

baseband communication

– Common mode signal is tapering out at the center of the RX transformer loop

Multi-Band ASK Receiver

34

Signal to Interference Ratio (SIR)

  • Determine the max effective communication distance using SIR
  • Major source of interference: Coupling from adjacent TL
  • Side walls between TLs effectively suppress the cross-talk
slide-18
SLIDE 18

18

35

Multi-band ASK RF-I Scaling

Technology # of Carriers data rate per carrier (Gb/s) Total Data rate per wire (Gb/s) Power (mW) Energy per bit(pJ) Area (TX+RX) mm2 Area/Gbit (µm2/Gbit)

90nm 3RF + 1 BB 5 20 20 1.00 0.022 1100 65nm 4RF + 1 BB 6 30 25 0.83 0.024 800 45nm 5RF + 1 BB 7 42 30 0.71 0.023 540 32nm 6RF + 1 BB 8 56 35 0.63 0.021 380 22nm 7RF + 1 BB 9 72 40 0.56 0.019 260

36

Comparison between Repeated Bus and Multi-band RF-I @ 32nm

Assumptions:

1. 32nm node; 30x repeater, FO4=8ps, Rwire = 306Ω/mm Cwire = 315fF/mm, wire pitch=0.2um, Bus length = 2cm, f_bus = 1GHz, Bus Width 96Byte 2. Repeaters Area = 0.022mm2 3. Bus physical width = 160um 4. In that width we can fit 13 transmission line, each with 7 carriers with carrying 8Gbps

Interconnect length = 2cm RF‐I Repeated Bus # of wire 13 448 Data rate per carrier (Gbit/s) 8 NA # of carrier 7 NA Data rate per carrier (Gbit/s) 56 1 Aggregate Data Rate 728 768 Bus Physical Width 160 160 Transceiver Area (mm2) 0.27 0.022 Power (mW) 455 6144 Energy per bit (pJ/bit) 0.63 8

slide-19
SLIDE 19

19

37

RF-I built on top of 2D-Mesh of CMP-

NoC facilitates “super-highway” network for inter-core communications

Enables simultaneous multi-band

communications by using multiple frequency carriers up to fmax of the super- scaled CMOS device (100-500GHz)

Encodes data by phase or amplitude

modulation

Uses direct coupling between the

transmission line and electronic transceivers

Enhances performance with scaling

(higher aggregate data rates, lower latency, lower energy/bit and lower area consumption/bit)

RF-I enables ultra-high performance CMP with low latency, low energy per bit, high aggregated data rate and bandwidth/route reconfigurable inter- core and inter-core-memory communications. RF T-Line line overlaid on single-chip CMP tapped with T/R circuit

RF-I for CMP Inter-Core Communications

38

  • Comparison across process technology
  • f…

– Traditional RC parallel bus – RF-Interconnect – Optical Interconnect

  • As process technology scales toward

22nm…

– RF-I has the lowest latency – RF-I consumes least energy – RF-I has highest data rate density

  • RF-I is fully compatible with modern CMOS

technology

RC/RF/Optical Interconnect Comparison

slide-20
SLIDE 20

20

39

RF-I Fill the Technology Gap between RC Repeater and Optical Interconnect*

  • On-Chip:

– RC Repeater is non-scalable – RF-I has better energy efficiency d > 1mm

  • Off-Chip**:

– RF-I has better energy efficiency d < 30cm – Over-head of Optical-I is too high

  • RF-I may be the prefect fit for the mid-range

interconnect

*Sai-Wang Tam, et.al, "Ultra-Low Power/Latency and Scalable Multiband RF-Interconnect for Reconfigurable," Submitted to Proceeding of IEEE **H. Cho, et.al, “Power comparison between high-speed electrical and optical interconnects for interchip communication,” J. Lightw. Technol, Sep. 2004.

On-chip Off-chip**

40

Quick RF-I Summary

  • Bandwidth – high aggregate data rates through simultaneous

transmissions of multiple bands with RF carrier modulated signals (324GH carrier recently realized in 90nm CMOS, Chang et al., 2008 ISSCC)

  • Energy – low overall energy per bit (0.1pJ/bit/mm in 90nm to

0.05pJ/bit/mm in 22nm CMOS)

  • Low Overhead –High data rate/wire and low area/Gigabit and low

latency due to speed-of-light data transmission

  • Re-configurability –efficient simultaneous communications with

adaptive bandwidths via shared on/off-chip transmission medium

  • Total compatibility and scalability with mainstream digital

CMOS technology

  • Multicast support – scalable means to communicate from one

transmitter to a number of receivers on chip

slide-21
SLIDE 21

21

41

RF-I Enabled NoC Communication Architecture

42

Outline

  • Application Diversity and NoC Motivation
  • Adaptive RF-Interconnect in NoC Design

– RF-I Overlaid on a Mesh NoC – Shortcut Selection

  • Architecture implications

– Performance improvement – Power Savings – Efficient multicast support – Deadlock

  • Conclusions and Future Work
slide-22
SLIDE 22

22

43

Outline

  • Application Diversity and NoC Motivation
  • Adaptive RF-Interconnect in NoC Design

– RF-I Overlaid on a Mesh NoC – Shortcut Selection

  • Architecture implications

– Performance improvement – Power Savings – Efficient multicast support – Deadlock

  • Conclusions and Future Work

44

Communication Diversity

  • Diverse communication patterns in parallel applications

– Different models of parallelism

  • Data decomposition, pipelined parallelism, master/worker, etc

– Different data inputs

  • Can vary communication hotspots and bandwidth demand

– Cache coherence

  • May favor broadcast/multicast
  • Implies applications have different “ideal” NoCs

– Topology – Bandwidth allocation – Latency

  • NoC design alternatives

– Traditional approach – Design for the general case

  • High bandwidth links in a uniform topology

– Our approach – Provide bandwidth where it is needed

  • Reconfigurable RF-I flexibly allocates bandwidth
slide-23
SLIDE 23

23

45

Architectural Considerations for RF-I

  • Opportunities (both on and off chip)

– High bandwidth communication

  • Data distribution across many-core topologies
  • Vital in keeping many-core designs active

– Low latency communication

  • Enables users to apply parallel computing to a broader applications

through faster synchronization and communication

  • Faster cache coherence protocols

– Reconfigurability

  • Adapt NoC topology/bandwidth to the needs of the individual

application

– Power efficient communication

  • Challenges

– Frequency arbitration and Tx/Rx tuning – Application-specific modeling

46

Baseline Architecture

R R C R C R C R R R C R R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R R R R R R R R C R C R C R R R C R R R R R C R C R C R R R C R R

$ $

R

$ $

R

$ $

R

$

R R C R C R C R C R C R C R R R C R C R R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R R R R R R R R C R C R C R C R C R C R R R C R C R R R R R C R C R C R C R C R C R R R C R C R R

$

R

$ $

R

$

R

$ $

R

$

R

$ $

R

$

R

$ $ $ $ $ $ $ $ $

R

$ $ $ $

R R R

$ $ $ $ $ $

R R R R R R R

$ $ $ $ $ $ $

R R C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R (square) = router C (circle) = processor core $ (diamond) = cache bank + (plus) = main memory interface

  • 10x10 mesh of pipelined

routers

– NoC runs at 2GHz – XY routing

  • 64 4GHz 3-wide processor

cores containing

– 8KB L1 Data Cache – 8KB L1 Instruction Cache

  • 32 L2 Cache Banks

– 256KB each – Organized as shared NUCA cache

  • 4 Main Memory Interfaces

– Labeled with + in the figure

slide-24
SLIDE 24

24

47

Quantifying Application Diversity

  • For a 100 (10x10 mesh) router configuration:
  • Measures messages sent from a router on x-

axis to router on y-axis

  • Legend for the figure on the coming slide

– Black:

no traffic

– Dark Blue: [1, mean / 4) – Light Blue: [mean/4 , mean/2) – White :

[mean/2, 2*mean)

– Orange:

[2*mean, 4*mean]

– Red:

(4*xmean, inf)

48

Messages Sent between Routers

Barnes

High communication High communication

slide-25
SLIDE 25

25

49

Messages Sent between Routers

LU

High communication High communication

50

Outline

  • Application Diversity and NoC Motivation
  • Adaptive RF-Interconnect in NoC Design

– RF-I Overlaid on a Mesh NoC – Shortcut Selection

  • Architecture implications

– Performance improvement – Power Savings – Efficient multicast support – Deadlock

  • Conclusions and Future Work
slide-26
SLIDE 26

26

51

RF-I Physical Organization

  • Physically

– RF-I is a bundle of transmission lines – Connected to and shared between set of RF-enabled routers – RF-enabled router consists of a Tx/Rx pair – single cycle transmission across 400mm2 die – 16 carrier frequencies per transmission line

  • @32 nm, with NoC running

@2GHz

52

RF-Enabled Routers

RF-Enable 50 Routers

  • Represented by GREEN

Routing Tables

R

Add 6th Port

RX TX

Transmission Line…

slide-27
SLIDE 27

27

53

RF-I Logical Organization

  • Logically:
  • RF-I behaves as set of

N express channels

  • Each channel assigned

to src, dest router pair (s,d)

  • Reconfigured by:
  • remapping shortcuts to

match needs of different applications

LOGICAL A LOGICAL B

54

Outline

  • Application Diversity and NoC Motivation
  • Adaptive RF-Interconnect in NoC Design

– RF-I Overlaid on a Mesh NoC – Shortcut Selection

  • Architecture implications

– Performance improvement – Power Savings – Efficient multicast support – Deadlock

  • Conclusions and Future Work
slide-28
SLIDE 28

28

55

Architecture-Specific Shortcuts

Design time shortcuts Referred to as static shortcuts in the remainder of

this talk

Selection Criteria Consider an optimization function for a topology length of shortest-path(x,y) if x != y if x == y

Wx,y =

We wish to minimize the total cost of the graph G

representing the network-on-chip

Σ

all(x,y)

Wx,y Total-Cost(G) =

56

Shortcut-Selection Constraints

  • Each router should have at most 6 ports

– A router can be at most one shortcut source and at most one shortcut destination

  • Total of B (budget) unidirectional shortcuts: B =

16

  • For static shortcuts:

– RF-enable routers which are shortcut srcs/dests – At most 16 RF-enabled routers

  • For adaptive shortcuts, shortcut srcs/dests are

limited to

– RF-enabled routers chosen at design-time

slide-29
SLIDE 29

29

57

Min Total-Cost(NoC): Heuristic 1

I) For each pair of non-adjacent routers i,j – Make a new candidate graph Gi,j with an edge between them – Calculate Total-Cost(Gi,j) – Record improvement as… II) Select shortcut of edge (x,y) such that Gx,y had max improvement – Disallow any use of x as a src or y as a dest afterwards III) Repeat (I) and (II) until budget B exhausted

R R C R C R C R R R C R R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R R R R R R R R C R C R C R R R C R R R R R C R C R C R R R C R R

$ $

R

$ $

R

$ $

R

$

R R C R C R C R C R C R C R R R C R C R R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R R R R R R R R C R C R C R C R C R C R R R C R C R R R R R C R C R C R C R C R C R R R C R C R R

$

R

$ $

R

$

R

$ $

R

$

R

$ $

R

$

R

$ $ $ $ $ $ $ $ $

R

$ $ $ $

R R R

$ $ $ $ $ $

R R R R R R R

$ $ $ $ $ $ $

R R C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

G

|Total-Cost(Gi,j) – Total-Cost(G)| i j

Gi,j Gx,y is best

O(BV5)

58

Min Total-Cost(NoC): Heuristic 2

(I) Calculate Wi,j for all pairs i,j in G

– Record all Wi,j

(II) Select shortcut of edge (x,y) s.t Wx,y = max(Wi,j)

– Which is the graph diameter – Disallow any use of x as a src

  • r y as a dest afterwards

(III) Repeat (I) and (II) until budget exhausted

R R C R C R C R R R C R R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R R R R R R R R C R C R C R R R C R R R R R C R C R C R R R C R R

$ $

R

$ $

R

$ $

R

$

R R C R C R C R C R C R C R R R C R C R R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R R R R R R R R C R C R C R C R C R C R R R C R C R R R R R C R C R C R C R C R C R R R C R C R R

$

R

$ $

R

$

R

$ $

R

$

R

$ $

R

$

R

$ $ $ $ $ $ $ $ $

R

$ $ $ $

R R R

$ $ $ $ $ $

R R R R R R R

$ $ $ $ $ $ $

R R C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

G

These shortcuts tend to perform within 1% as well as those chosen with heuristic 1

O(BV3)

slide-30
SLIDE 30

30

59

Adaptive RF-I Shortcuts

  • Assume a profile of communication for an application

– Fi,j = count of messages sent between router i and router j

  • Change optimization function
  • To offset effect of removing src/dest routers (already

selected) from consideration

– Alternate router-to-router shortcuts with region-to-region shortcuts – Allows placement of shortcuts at routers near a hotspot

Σ

all(x,y)

(Fx,y Wx,y) Total-Cost(G) =

.

60

Outline

  • Application Diversity and NoC Motivation
  • Adaptive RF-Interconnect in NoC Design

– RF-I Overlaid on a Mesh NoC – Shortcut Selection

  • Architecture implications

– Performance improvement – Power Savings – Efficient multicast support – Deadlock

  • Conclusions and Future Work
slide-31
SLIDE 31

31

61

Results on Performance Improvement

  • Static Shortcuts

– 20% reduction in latency on average – 11% increase in NoC power

  • Adaptive Shortcuts with 50 RF-I routers

– 32% reduction in latency on average – 24% increase in NoC power

62

Power Savings

  • We can thin the

baseline mesh links

– From 16B… – …to 8B – …to 4B

  • RF-I makes up the

difference in performance while saving overall power!

– RF-I provides bandwidth where most necessary – Baseline RC wires supply the rest

16 bytes 8 bytes 4 bytes

Requires high bw to communicate w/ B

A B

slide-32
SLIDE 32

32

63

Evaluation Methodology

  • Used detailed interconnection network

simulator - Garnet[1]

  • Built probabilistic traces

– To cover different communication patterns that may be exhibited by future applications

  • Leveraged Orion[4], CosiNoC[5], IPEM[2] for

power methodology

64

RF-I Enables Power Savings

Relative latency Relative power

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 uniform uniDF biDF hotbiDF 1Hotspot 2Hotspot 4Hotspot 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Static Adaptive Baseline - 8B Static - 8B Adaptive - 8B Baseline - 4B Static - 4B Adaptive - 4B

  • On average adaptive shortcuts w/ 50 RF-enabled routers on a 4B mesh
  • 62%(82%) power(area) savings over baseline
  • Performance comparable to baseline
slide-33
SLIDE 33

33

65

RF-I Enables a Power Savings

  • Adaptive RF-I enabled NoC
  • Cost Effective in terms of both power and performance

66

Outline

  • Application Diversity and NoC Motivation
  • Adaptive RF-Interconnect in NoC Design

– RF-I Overlaid on a Mesh NoC – Shortcut Selection

  • Architecture implications

– Performance improvement – Power Savings – Efficient multicast support – Deadlock

  • Conclusions and Future Work
slide-34
SLIDE 34

34

67

RF-I Enabled Multicast

Get S 2 1 3 4 2 1 1 1 1 1

FILL

Fill Conventional NoC Request Scenario

Rx Rx Tx Rx Tx Rx Tx Rx Tx Rx Tx Rx Tx Rx Tx Rx Tx Tx

RF-I enabled NoC

68

Virtual Circuit Tree Multicast [6]

  • Demonstrated the importance of multicast for

current/future NoCs

  • Enhance NoC routers with additional state

tables

– Store routing information for dynamically discovered multicast trees – Dynamically spawn packets to follow communication tree

  • Reduce NoC congestion and multicast latency
slide-35
SLIDE 35

35

69

RF-I Enabled Multicast

  • RF-I provides natural means for multicast
  • Multiple receivers listen to same channel
  • Conceptually, RF shortcut with multiple

destinations

70

Multicast in our architecture

  • 50 RF-enabled

routers

  • 16 adaptive

shortcuts

  • 34 routers left to

tune to the multicast channel

  • Allocate 1 channel

for multicast

  • 15 adaptive

shortcuts

  • 35 routers left to

tune to the multicast channel

slide-36
SLIDE 36

36

71

RF-I Enabled Multicast (cont)

  • We accelerate two multicast messages: Fills and

Invalidates

  • Both of which are issued by cache banks

– Limit multicast senders to be cache banks

  • We use coarse-grain arbitration scheme

– to decide which component can use the multicast channel – A cache bank in a cluster is chosen as the designated multicast sender for a fixed period of time – The caches sent multicast message to the designated – sender over conventional wires

72

Designated MC Sender Wants to send MC msgs

TRANSMIT RECEIVE

MC recipients

Example Multicast(MC) Scenario

slide-37
SLIDE 37

37

73

Multicast Results

  • On average RF-I MC+ SC provides:
  • 37% reduction in latency
  • at a cost of 25% increase in NoC Power
  • 20 and 50 indicates:
  • % of distinct source-destination pairs
  • simulate multicast destination reuse
  • Perform a fair comparison with VCT

74

Unified Analysis

slide-38
SLIDE 38

38

75

Outline

  • Application Diversity and NoC Motivation
  • Adaptive RF-Interconnect in NoC Design

– RF-I Overlaid on a Mesh NoC – Shortcut Selection

  • Architecture implications

– Performance improvement – Power Savings – Efficient multicast support – Deadlock

  • Conclusions and Future Work

76

Deadlock: To Avoid or Confront?

  • South-Last Strategy [Ogras and Marculescu, 2006]

– Routes which can lead to circular buffer dependence are forbidden avoids deadlock – Restricts shortcut selection

  • Results show that it effectively halves performance
  • Deadlock Detection & Recovery (DDR)

– Based on Duato and Pinkston’s theory [Duato and Pinkston 2001]

  • If deadlock occurs, route all packets in the network on a spare virtual

channel

  • Use deadlock-free XY-routing
  • Packets entering network after this point may be routed normally
slide-39
SLIDE 39

39

77

How to detect deadlock…?

  • Rather than detect that deadlock has occurred

– Detect a sufficient condition for deadlock: circular buffer dependency

  • Each router maintains a list of other routers waiting on

it

  • When buffer at neighbor router d is full, sender s

transmits waiting-list message to neighbor

– Bit vector indicating which routers are waiting on s, as well as s’s ID – If a router is “waiting on itself,” circular buffer dependency has occurred

  • Raise DEADLOCK condition

– If d’s buffer empties, s sends one time clear-waiting-list message to reset state

78

Deadlock Detection Example

  • N inbound buffer at R21 fills up
  • R11 can’t send to R21
  • R11 tells R21:

– {R11} waiting on you

  • W inbound buffer at R11 fills up
  • R10 tells R11:

– {R10} waiting on you

  • R11 tells R21:

– {R10,R11} waiting on you

  • If there is circular dependence

– R21 will eventually see that it is waiting on itself DEADLOCK!

… …

R11 R21

in

  • ut

in

  • ut

in

  • ut

R10

in

  • ut

… … … … … …

R0 R1 R10 R11 R21

… … 1 1

R0 R1 R10 R11 R21

… … 1

R0 R1 R10 R11 R21

… … 1 1

R0 R1 R10 R11 R21

… … 1 1

slide-40
SLIDE 40

40

79

Outline

  • Application Diversity and NoC Motivation
  • Adaptive RF-Interconnect in NoC Design

– RF-I Overlaid on a Mesh NoC – Shortcut Selection

  • Architecture implications

– Performance improvement – Power Savings – Efficient multicast support – Deadlock

  • Conclusions and Future Work

80

Conclusion

RF-Interconnect:

  • Enables adaptive NoC

– Bandwidth can be flexibly allocated – To match the communication demands of applications

  • Offers dramatic power and area savings

– By simplifying baseline NoC topology – Provides performance of 16B mesh on a 4B mesh

  • 62% power savings, 82% area savings
  • Natural means of multicast
slide-41
SLIDE 41

41

81

Future Directions

  • Fine-Grain Adaptation

– On-Demand or Phase-Specific Shortcut/Multicast

  • Message-Based Acceleration

– Application-Specific Synchronization – Cache Coherence – NUCA Migration

  • Deadlock Free Routing

– Application-Specific Turn Removal

  • Physical Implementation

82

References

[1] N. Agarwal, L-S Peh, and N. Jha. Garnet: A detailed interconnection network model inside a full-system simulation framework. Technical Report CE-P08-001, Dept.

  • f Electrical Engineering, Princeton University, 2007.

[2] M. Chang, J. Cong, A. Kaplan, M. Naik, G. Reinman, E. Socher and S.W. Tam, “CMP Network-

  • n-Chip Overlaid With Multi-Band RF-Interconnect,” The 14th International Symposium on

High-Performance Computer Architecture, Salt Lake City, UT, pp. 191-202, February 2008. (Best Paper Award) [3] M. F. Chang, J. Cong, A. Kaplan, C. Liu, M. Naik, J. Premkumar, G. Reinman, E. Socher, and R. Tam, “Power Reduction of CMP Communication Networks via RFInterconnects,” Proceedings

  • f the 41st Annual International Symposium on Microarchitecture (MICRO), Lake Como, Italy,

pages 376-387, November 2008 [4] J. Cong and D.Z. Pan. Interconnect estimation and planning for deep submicron designs. In Proceedings of DAC-36, 1999. [5] J.D. Owens, W.J. Dally, R. Ho, D.N. Jayasimha, S.W. Keckler, and L-S. Peh. Research challenges for on-chip interconnection networks. IEEE Micro, 27(5):96–108, 2007. [6] A. Pinto, L. Carloni, and A. Sangiovanni-Vincentelli. Constraint-driven communication synthesis. In Design Automation Conference, June 2002 [7] H. Wang, X. Zhu, L-S. Peh, and S. Malik. Orion: A power performance simulator for interconnection networks. In Proceedings of MICRO-35, November 2002. [8] N. Jerger, L. Peh, and M. Lipasti. Virtual Circuit Tree Multicasting: A Case for Hardware Multicast Support. International Symposium on Computer Architecture, June 2008. For updated slides of this tutorial, please go to http://cadlab.cs.ucla.edu/~cong