[PDF] - RF-Interconnect RF-Interconnect and its Applications to and its PDF Document

SLIDE 1

1

RF-Interconnect RF-Interconnect and its Applications to and its Applications to NoC Design NoC Design

Frank Chang, Jason Cong and Glenn Reinman

E-mails: mfchang@ee.ucla.edu cong@cs.ucla.edu Glenn.reinman@cs.ucla.edu NOC NOCS Tutorial Course, Tutorial Course, May 10 2009 y 10 2009 San Diego, Californ San Diego, California

2

RF-Interconnect RF-Interconnect

SLIDE 2

2

3

Outline

Future Network-on-Chip (NoC) needs and

development trends

Traditional baseband-interconnect constraints
Multiband RF-Interconnect (RF-I) advantages

– Scalability in latency, energy/bit, data rate (Gbps/link) and overhead (area/Gb) – On-chip demonstrations – Off-chip demonstrations – Remaining technology challenges

Potential RF-I system applications

4

Current Trend in CMP

65nm CMOS 80 tile NoC
10X8 2D mesh network-on-

chip running @ 4GHz

Bisection bandwidth

256GB/s

1 TFLOPS @ 1V about 98W
Needs total 75 Clk cycles

from the lower left corner to the upper right corner

ISSCC 2007: An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS (Sriram Vangal et al., Intel)

SLIDE 3

3

5

Future CMP Development

Trends:

Heterogeneous/domain-specific system architecture
Many-core massive parallel data processing
System integration in deep-scaled CMOS technology
Low supply voltage with sub-Vth digital operation

Issues:

Performance increasingly dependent on inter-core or

inter-system communications

6

Scaling of Traditional Interconnect

Scaling reduces delay of logic gates but not of wires
Latency is RC limited (~L2)
Using CMOS repeaters reduces latency (~L) but receives no benefit

from scaling

Even low swing signaling requires extensive equalization
Waste of broad bandwidth available from modern CMOS devices

(ft>150GHz, fmax>250GHz)

10

T

f

SLIDE 4

4

7

Baseband Interconnect Issues

Latency is large across the chip
Bandwidth is RC limited (~1Gbps/wire)
Communication pattern is fixed (non-reconfigurable)
Energy consumption is high and not scalable

(~10pJ/bit/cm)

At 22nm technology, the total network power using

buffer can be as high as 150W*

Future microprocessors may encounter

communication congestion and most of the energy will be spent on “talking” instead of computing

*“Research Challenges for On-Chip Interconnection Networks,” IEEE Micro, 2007

8

Communication Challenges

On-Chip Issues

– # Cores in Chip-Multiprocessor (CMP) growing

Increasing bandwidth demand on interconnect

– Wires scaling poorly compared to transistors

Increased latency to communicate between distant points on

CMP

Off-chip limited by chip-to-chip, board-to-board, board-to-backplane

communications

Requirements on future interconnect

– Scalable, reliable – Support high traffic volume with low latency – Constrained by

Power
Silicon Area
Cost (compatibility with mainstream CMOS technology)

SLIDE 5

5

9

How Can RF Help?

fT will exceed 600GHz

at16nm and fmax will even approach 1THz!

Millimeter-wave CMOS

circuits have been developed for 60GHz and recently for 324 GHz bands*

Incredible bandwidth is

available in future but most people neglect that!

EM waves travel at the

(effective) speed of light (~7ps/mm in Silicon)

*Huang, Larocca and Chang, “324GHz CMOS Frequency Generator using Linear Superposition Technique,” pp. 476- 477, 2008 ISSCC

10

100
90
80
70

323.038 323.238 323.438 323.638 323.838 324.0

Frequency (GHz) Pout (dBm)

UCLA 90nm CMOS VCO at 324GHz

(ISSCC 2008*)

CMOS Voltage Controlled Oscillator, measured with a subharmonic mixer and driven with a 80 GHz synthesizer local oscillator. The mixing frequency is (fVCO - 4*fLO)=fIF, or fVCO -4*(80 GHz)= 3.5 GHz, yielding fVCO= 323.5 GHz! On-Wafer VCO Test Setup at JPL

CMOS VCO designed by Frank Chang’s group at UCLA, fabricated in 90nm process 323.5GHz VCO

*Huang, D., LaRocca T., Chang, M.-C. F., “324GHz CMOS Frequency Generator Using Linear Superposition Technique IEEE International Solid-State Circuits Conference (ISSCC), 476-477, (Feb 2008) San Francisco, CA

SLIDE 6

6

11

4xf0 by Linear Superposition

4xfo by Linear Superposition

12

Communication beyond Baseband

Ultra-high carrier frequencies can be

generated and modulated by modern CMOS to enable simultaneous multiband communications with higher aggregate data rate

On/off-chip Transmission lines and off-

chip near-field antennas can guide waves (RF modulated data) from transmitter to receiver with recoverable attenuation in short distances (<30cm)

SLIDE 7

7

13

Carrier frequencies can be generated and

modulated using modern CMOS to enable simultaneous multiband communications with higher aggregate data rate

Higher carrier frequencies can avoid baseband

digital noise and cause less frequency dispersion across the band

Multiband Communications

14 Bi-directional Bus

Advantages:

Higher combined

data rate

Simultaneous,

bi-directional communications

Re-configurable

between bands

Low in-band

coupling for parallel bus

Potentially with

fewer I/O pins and smaller routing area

RF-Interconnect Concept

f

SLIDE 8

8

15

Loss of 1.5dB/mm over 100GHz
f Bandwidth

Differential TL: IBM 90nm Process Width: 3um Spacing: 3um Total Thickness: Two Top Metal = 1µm Metal Resistivity: 0.0424Ohm/Sq 3um 3um 3um 0.5um 0.5um

M8 M7

Differential Transmission Line

16

Multiband FDMA-Interconnect

In TX, each mixer up-converts individual baseband streams into

specific frequency band (or channel)

N different data streams (N=6 in exemplary figure above) may

transmit simultaneously on the shared transmission medium to achieve higher aggregate data rates

In RX, individual signals are down-converted by mixer, and

recovered after low-pass filter

Signal Spectrum

Signal Power Signal Power Signal Power Signal Power

SLIDE 9

9

17

D a t a D a t a

N

2

Data1

power

FDMA-Interconnect System

18

Incredible CMOS Bandwidth

100 200 300 400 500 600 700 800 900 1000 20 30 40 50 60 70 channel length (nm) freq (G H z )

ft_DRAM fmax_DRAM ft_NFET fmax_NFET

Technology (nm) 90 65 45 32 22 16 Number of Cores 8 16 40 64 120 189 BW (Bisection) [GB/s] 91 128 202 256 350 440 Chip Power (total) [W] 100 120 144 173 207 249 fmax [GHz] 200 270 370 480 590 710 fvco [GHz] 320 432 592 768 944 1136 Max Aggregate Data Rate [Gb/s/wire] 160 216 296 384 472 568

Maximum aggregate data rate for RF-Interconnect can reach 500Gb/s/wire @16nm Tech Node

SLIDE 10

10

19

Advantages of RF-I over Parallel Bus

Latency – speed-of-light data transmission
Bandwidth – high aggregate data rate through

simultaneous transmissions on multiple bands of RF modulated signals

Area – avoid extensive use of repeaters
Energy – low overall energy bit
Reconfigurability – efficient bidirectional and

tunable communications via shared on/off-chip transmission lines or off-chip antennas

20

RF-Interconnect Demonstrations

Off-chip (On-board) Simultaneous Dual-

band Communications through RF- Interconnect (ISSCC 05)

Inter-layer 3DIC RF-Interconnect (ISSCC 07)
On-chip Simultaneous generation of multi-

band carriers (RFIC 08)

On-Chip Tri-band simultaneous

communications (VLSI 09)

SLIDE 11

11

21

Off-Chip FDMA Links (ISSCC 05*)

2 carrier RF-I provide

simultaneous off chip between 4 CMOS chip in 0.18um technology

1 baseband and 1 RF band

at 7.4GHz

Selectivity between bands is

achieved using bandpass or lowpass filtering.

The RF carrier was

modulated using BPSK.

Using this scheme,

simultaneous data rates of (2+2) Gb/s were achieved in both the baseband (2Gbps) and the RF band (2Gbps).

*J. Ko, J. Kim, Z. Xu, Q. Gu, C. Chien, and M.F. Chang, “An RF/Baseband FDMA-Interconnect Transceiver for Reconfigurable Multiple Access Chip-to-Chip Communication,” in 2005 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, February 2005

22

3D-IC Layer-to-Layer RF- Interconnect (ISSCC 07*)

3DIC RF-I in MIT-

Lincoln Lab 0.18 μm 3DIC technology

Data is modulated

with amplitude shift keying (ASK) modulation of a 25GHz carrier

low energy-per-bit:

0.39pJ/bit

high data rate: 11Gb/s

22

(a) Schematic of 3DIC RF-I (b) Eye diagram with 11Gb/s data rate (c) Die photo of the 3DIC RF-I

Q. Gu, Z. Xu, J. Ko and M.F. Chang, "Two 10Gbps/pin Low Power Interconnect Methods for 3D

IC", 2007 IEEE International Solid-State Circuits Conference (ISSCC) Digest of Technical Papers, vol.50, pp.448-449, Feb. 2007, San Francisco, California, USA

SLIDE 12

12

23

Simultaneous Sub-harmonic Injection Locked mm-Wave Frequency Generation *(RFIC 2008)

Using sub-harmonic injection-

locked VCOs simultaneously lock to one single reference frequency

Advantages:

– Eliminate multiple PLLs – Low Power Consumption – Small Area

Master VCO

Non-linear Harmonic Generator

Slave VCOs

*Sai-Wang Tam, Eran Socher, Alden Wong, Yu Wang, Lan Vu, M.F. Chang, "Simultaneous Sub-harmonic Injection- Locked mm-Wave Frequency Generators for Multi-band Communications in CMOS", IEEE RFIC Sym., 2008

24

Sub-harmonic Injection Locked VCO*

(RFIC 2008)

LC-based VCO core
Differential pair for odd harmonic generation
Single-ended for even harmonic generation
Injection locking to high harmonic within locking range
f the VCO

Process Free Running Frequency (GHz) Max locking Range (GHz) Locking Harmonics Power (mW) This Work* IBM 90nm CMOS 29.3 5.6 2nd,4th, 6th, 8th 3rd, 5th, 7th 4

*Sai-Wang Tam, Eran Socher, Alden Wong, Yu Wang, Lan Vu, M.F. Chang, "Simultaneous Sub-harmonic Injection- Locked mm-Wave Frequency Generators for Multi-band Communications in CMOS", IEEE RFIC Sym., 2008

SLIDE 13

13

25 25

Simultaneous Sub-harmonic Injection Locked Multi-Frequency Generation (RFIC 2008)

Using sub-harmonic injection-locked VCOs simultaneous lock to one

single master reference frequency

Advantages:

– Eliminate Multiple PLLs – Low Power Consumption – Small in Silicon Area

Demonstrated Sub-harmonic 30GHz and 50GHz injection-locked VCO

in IBM 90nm Process

Master VCO

Non‐linear Harmonic Generator

Slave VCO

(a) Output Spectrum of the 30GHz and 50GHz VCO simultaneously locked with the same reference source at 9.7GHz (b) Die Photo of the 30GHz Sub-harmonic Injection VCO

25 26

RF-I using ASK Modulation

TX: The Transformer couples the output of the VCO to the ASK

modulator and use a simple modulator to generate ASK signal

RX: The differential-mutual-mixer (self-mixer) acts as the

envelope detector. Then a simple buffer and Schmitt Trigger recover the signal to rail-to-rail swing.

Don’t Need Carrier Synchronization!

SLIDE 14

14

27

Simulated RF-I using ASK Modulation

VCO Output: 60GHZ ASK modulated Signal Mixer output 5Gbit/s Data input 28

Tri-Band On-Chip RF-Interconnect

IBM 90nm process
5mm Differential Transmission Line
Total 3 Channels: 2RF + 1Baseband
Differential mode for RF: 30GHz and 50GHz
Common Mode for Baseband
Total Aggregate Data Rate is 10Gb/s

50GHz TX 30GHz TX Base Band TX 50GHz RX 30GHz RX Base Band RX

SLIDE 15

15

29

Tri-band FDMA-Interconnect Layout

30

Tri-band On-Chip RF-I Test Results

30GHz Channel 50 GHz Channel

30GHz Channel 50GHz Channel Base Band Channel

Process IBM 90nm CMOS Digital Process Total 3 Channels 30GHz, 50GHz, Base Band Data Rate in each channel RF Band: 4Gbps Base Band: 2Gbps Total Data Rate 10Gbps Bit Error Rate Across all Bands <10E‐9 Latency 6 ps/mm Enegry Per Bit (RF) 0.09*pJ/bit/mm Enegry Per Bit (BB) 0.125pJ/bit/mm

Data Output waveform Output Spectrum of the RF- Bands, 30GHz and 50GHz *VCO power (5mW) can be shared by all (many tens) parallel RF-I

links in NOC and does not burden individual link significantly.

SLIDE 16

16

31

Inter-channel Modulation in multi-band ASK Transmitter

Switch in f2 is able to directly

modulate the signal from f1 ⇒ Cause severe Inter-channel interference

Additional Transformer avoids

signal current flowing through the switch in other channel => No Inter-channel interference through direct modulation

32

Base Band Common Mode Interconnect*

Base Band is transmitted in Common Mode
Using Capacitive Coupling method:
Common Mode Swing is controlled to be about 100mV
Low Swing and save power

“A 5.6mW 1-Gbps pair Pulse Signaling Transceiver for a Fully AC Coupled Bus”, Jongsun Kim, Ingrid Verbauwhede, Mau-Chung Frank Chang, JSSC, VOL 40, No 6, June 2005

SLIDE 17

17

33

Differential mode for RF

communications

– Using inductive coupling with band-pass characteristic – It is able to filter out the undesired channel

Common mode for

baseband communication

– Common mode signal is tapering out at the center of the RX transformer loop

Multi-Band ASK Receiver

34

Signal to Interference Ratio (SIR)

Determine the max effective communication distance using SIR
Major source of interference: Coupling from adjacent TL
Side walls between TLs effectively suppress the cross-talk

SLIDE 18

18

35

Multi-band ASK RF-I Scaling

Technology # of Carriers data rate per carrier (Gb/s) Total Data rate per wire (Gb/s) Power (mW) Energy per bit(pJ) Area (TX+RX) mm2 Area/Gbit (µm2/Gbit)

90nm 3RF + 1 BB 5 20 20 1.00 0.022 1100 65nm 4RF + 1 BB 6 30 25 0.83 0.024 800 45nm 5RF + 1 BB 7 42 30 0.71 0.023 540 32nm 6RF + 1 BB 8 56 35 0.63 0.021 380 22nm 7RF + 1 BB 9 72 40 0.56 0.019 260

36

Comparison between Repeated Bus and Multi-band RF-I @ 32nm

Assumptions:

1. 32nm node; 30x repeater, FO4=8ps, Rwire = 306Ω/mm Cwire = 315fF/mm, wire pitch=0.2um, Bus length = 2cm, f_bus = 1GHz, Bus Width 96Byte 2. Repeaters Area = 0.022mm2 3. Bus physical width = 160um 4. In that width we can fit 13 transmission line, each with 7 carriers with carrying 8Gbps

Interconnect length = 2cm RF‐I Repeated Bus # of wire 13 448 Data rate per carrier (Gbit/s) 8 NA # of carrier 7 NA Data rate per carrier (Gbit/s) 56 1 Aggregate Data Rate 728 768 Bus Physical Width 160 160 Transceiver Area (mm2) 0.27 0.022 Power (mW) 455 6144 Energy per bit (pJ/bit) 0.63 8

SLIDE 19

19

37

RF-I built on top of 2D-Mesh of CMP-

NoC facilitates “super-highway” network for inter-core communications

Enables simultaneous multi-band

communications by using multiple frequency carriers up to fmax of the super- scaled CMOS device (100-500GHz)

Encodes data by phase or amplitude

modulation

Uses direct coupling between the

transmission line and electronic transceivers

Enhances performance with scaling

(higher aggregate data rates, lower latency, lower energy/bit and lower area consumption/bit)

RF-I enables ultra-high performance CMP with low latency, low energy per bit, high aggregated data rate and bandwidth/route reconfigurable inter- core and inter-core-memory communications. RF T-Line line overlaid on single-chip CMP tapped with T/R circuit

RF-I for CMP Inter-Core Communications

38

Comparison across process technology
f…

– Traditional RC parallel bus – RF-Interconnect – Optical Interconnect

As process technology scales toward

22nm…

– RF-I has the lowest latency – RF-I consumes least energy – RF-I has highest data rate density

RF-I is fully compatible with modern CMOS

technology

RC/RF/Optical Interconnect Comparison

SLIDE 20

20

39

RF-I Fill the Technology Gap between RC Repeater and Optical Interconnect*

On-Chip:

– RC Repeater is non-scalable – RF-I has better energy efficiency d > 1mm

Off-Chip**:

– RF-I has better energy efficiency d < 30cm – Over-head of Optical-I is too high

RF-I may be the prefect fit for the mid-range

interconnect

*Sai-Wang Tam, et.al, "Ultra-Low Power/Latency and Scalable Multiband RF-Interconnect for Reconfigurable," Submitted to Proceeding of IEEE **H. Cho, et.al, “Power comparison between high-speed electrical and optical interconnects for interchip communication,” J. Lightw. Technol, Sep. 2004.

On-chip Off-chip**

40

Quick RF-I Summary

Bandwidth – high aggregate data rates through simultaneous

transmissions of multiple bands with RF carrier modulated signals (324GH carrier recently realized in 90nm CMOS, Chang et al., 2008 ISSCC)

Energy – low overall energy per bit (0.1pJ/bit/mm in 90nm to

0.05pJ/bit/mm in 22nm CMOS)

Low Overhead –High data rate/wire and low area/Gigabit and low

latency due to speed-of-light data transmission

Re-configurability –efficient simultaneous communications with

adaptive bandwidths via shared on/off-chip transmission medium

Total compatibility and scalability with mainstream digital

CMOS technology

Multicast support – scalable means to communicate from one

transmitter to a number of receivers on chip

SLIDE 21

21

41

RF-I Enabled NoC Communication Architecture

42

Outline

Application Diversity and NoC Motivation
Adaptive RF-Interconnect in NoC Design

– RF-I Overlaid on a Mesh NoC – Shortcut Selection

Architecture implications

– Performance improvement – Power Savings – Efficient multicast support – Deadlock

Conclusions and Future Work

SLIDE 22

22

43

Outline

Application Diversity and NoC Motivation
Adaptive RF-Interconnect in NoC Design

– RF-I Overlaid on a Mesh NoC – Shortcut Selection

Architecture implications

– Performance improvement – Power Savings – Efficient multicast support – Deadlock

Conclusions and Future Work

44

Communication Diversity

Diverse communication patterns in parallel applications

– Different models of parallelism

Data decomposition, pipelined parallelism, master/worker, etc

– Different data inputs

Can vary communication hotspots and bandwidth demand

– Cache coherence

May favor broadcast/multicast
Implies applications have different “ideal” NoCs

– Topology – Bandwidth allocation – Latency

NoC design alternatives

– Traditional approach – Design for the general case

High bandwidth links in a uniform topology

– Our approach – Provide bandwidth where it is needed

Reconfigurable RF-I flexibly allocates bandwidth

SLIDE 23

23

45

Architectural Considerations for RF-I

Opportunities (both on and off chip)

– High bandwidth communication

Data distribution across many-core topologies
Vital in keeping many-core designs active

– Low latency communication

Enables users to apply parallel computing to a broader applications

through faster synchronization and communication

Faster cache coherence protocols

– Reconfigurability

Adapt NoC topology/bandwidth to the needs of the individual

application

– Power efficient communication

Challenges

– Frequency arbitration and Tx/Rx tuning – Application-specific modeling

46

Baseline Architecture

R R C R C R C R R R C R R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R R R R R R R R C R C R C R R R C R R R R R C R C R C R R R C R R

$ $

R

$ $

R

$ $

R

$

R R C R C R C R C R C R C R R R C R C R R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R R R R R R R R C R C R C R C R C R C R R R C R C R R R R R C R C R C R C R C R C R R R C R C R R

$

R

$ $

R

$

R

$ $

R

$

R

$ $

R

$

R

$ $ $ $ $ $ $ $ $

R

$ $ $ $

R R R

$ $ $ $ $ $

R R R R R R R

$ $ $ $ $ $ $

R R C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R (square) = router C (circle) = processor core $ (diamond) = cache bank + (plus) = main memory interface

10x10 mesh of pipelined

routers

– NoC runs at 2GHz – XY routing

64 4GHz 3-wide processor

cores containing

– 8KB L1 Data Cache – 8KB L1 Instruction Cache

32 L2 Cache Banks

– 256KB each – Organized as shared NUCA cache

4 Main Memory Interfaces

– Labeled with + in the figure

SLIDE 24

24

47

Quantifying Application Diversity

For a 100 (10x10 mesh) router configuration:
Measures messages sent from a router on x-

axis to router on y-axis

Legend for the figure on the coming slide

– Black:

no traffic

– Dark Blue: [1, mean / 4) – Light Blue: [mean/4 , mean/2) – White :

[mean/2, 2*mean)

– Orange:

[2mean, 4mean]

– Red:

(4*xmean, inf)

48

Messages Sent between Routers

Barnes

High communication High communication

SLIDE 25

25

49

Messages Sent between Routers

LU

High communication High communication

50

Outline

Application Diversity and NoC Motivation
Adaptive RF-Interconnect in NoC Design

– RF-I Overlaid on a Mesh NoC – Shortcut Selection

Architecture implications

– Performance improvement – Power Savings – Efficient multicast support – Deadlock

Conclusions and Future Work

SLIDE 26

26

51

RF-I Physical Organization

Physically

– RF-I is a bundle of transmission lines – Connected to and shared between set of RF-enabled routers – RF-enabled router consists of a Tx/Rx pair – single cycle transmission across 400mm2 die – 16 carrier frequencies per transmission line

@32 nm, with NoC running

@2GHz

52

RF-Enabled Routers

RF-Enable 50 Routers

Represented by GREEN

Routing Tables

R

Add 6th Port

RX TX

Transmission Line…

SLIDE 27

27

53

RF-I Logical Organization

Logically:
RF-I behaves as set of

N express channels

Each channel assigned

to src, dest router pair (s,d)

Reconfigured by:
remapping shortcuts to

match needs of different applications

LOGICAL A LOGICAL B

54

Outline

Application Diversity and NoC Motivation
Adaptive RF-Interconnect in NoC Design

– RF-I Overlaid on a Mesh NoC – Shortcut Selection

Architecture implications

– Performance improvement – Power Savings – Efficient multicast support – Deadlock

Conclusions and Future Work

SLIDE 28

28

55

Architecture-Specific Shortcuts

Design time shortcuts Referred to as static shortcuts in the remainder of

this talk

Selection Criteria Consider an optimization function for a topology length of shortest-path(x,y) if x != y if x == y

Wx,y =

We wish to minimize the total cost of the graph G

representing the network-on-chip

Σ

all(x,y)

Wx,y Total-Cost(G) =

56

Shortcut-Selection Constraints

Each router should have at most 6 ports

– A router can be at most one shortcut source and at most one shortcut destination

Total of B (budget) unidirectional shortcuts: B =

16

For static shortcuts:

– RF-enable routers which are shortcut srcs/dests – At most 16 RF-enabled routers

For adaptive shortcuts, shortcut srcs/dests are

limited to

– RF-enabled routers chosen at design-time

SLIDE 29

29

57

Min Total-Cost(NoC): Heuristic 1

I) For each pair of non-adjacent routers i,j – Make a new candidate graph Gi,j with an edge between them – Calculate Total-Cost(Gi,j) – Record improvement as… II) Select shortcut of edge (x,y) such that Gx,y had max improvement – Disallow any use of x as a src or y as a dest afterwards III) Repeat (I) and (II) until budget B exhausted

R R C R C R C R R R C R R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R R R R R R R R C R C R C R R R C R R R R R C R C R C R R R C R R

$ $

R

$ $

R

$ $

R

$

R R C R C R C R C R C R C R R R C R C R R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R R R R R R R R C R C R C R C R C R C R R R C R C R R R R R C R C R C R C R C R C R R R C R C R R

$

R

$ $

R

$

R

$ $

R

$

R

$ $

R

$

R

$ $ $ $ $ $ $ $ $

R

$ $ $ $

R R R

$ $ $ $ $ $

R R R R R R R

$ $ $ $ $ $ $

R R C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

G

|Total-Cost(Gi,j) – Total-Cost(G)| i j

Gi,j Gx,y is best

O(BV5)

58

Min Total-Cost(NoC): Heuristic 2

(I) Calculate Wi,j for all pairs i,j in G

– Record all Wi,j

(II) Select shortcut of edge (x,y) s.t Wx,y = max(Wi,j)

– Which is the graph diameter – Disallow any use of x as a src

r y as a dest afterwards

(III) Repeat (I) and (II) until budget exhausted

R R C R C R C R R R C R R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R R R R R R R R C R C R C R R R C R R R R R C R C R C R R R C R R

$ $

R

$ $

R

$ $

R

$

R R C R C R C R C R C R C R R R C R C R R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R C R R R R R R R R C R C R C R C R C R C R R R C R C R R R R R C R C R C R C R C R C R R R C R C R R

$

R

$ $

R

$

R

$ $

R

$

R

$ $

R

$

R

$ $ $ $ $ $ $ $ $

R

$ $ $ $

R R R

$ $ $ $ $ $

R R R R R R R

$ $ $ $ $ $ $

R R C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R

G

These shortcuts tend to perform within 1% as well as those chosen with heuristic 1

O(BV3)

SLIDE 30

30

59

Adaptive RF-I Shortcuts

Assume a profile of communication for an application

– Fi,j = count of messages sent between router i and router j

Change optimization function
To offset effect of removing src/dest routers (already

selected) from consideration

– Alternate router-to-router shortcuts with region-to-region shortcuts – Allows placement of shortcuts at routers near a hotspot

Σ

all(x,y)

(Fx,y Wx,y) Total-Cost(G) =

.

60

Outline

Application Diversity and NoC Motivation
Adaptive RF-Interconnect in NoC Design

– RF-I Overlaid on a Mesh NoC – Shortcut Selection

Architecture implications

– Performance improvement – Power Savings – Efficient multicast support – Deadlock

Conclusions and Future Work

SLIDE 31

31

61

Results on Performance Improvement

Static Shortcuts

– 20% reduction in latency on average – 11% increase in NoC power

Adaptive Shortcuts with 50 RF-I routers

– 32% reduction in latency on average – 24% increase in NoC power

62

Power Savings

We can thin the

baseline mesh links

– From 16B… – …to 8B – …to 4B

RF-I makes up the

difference in performance while saving overall power!

– RF-I provides bandwidth where most necessary – Baseline RC wires supply the rest

16 bytes 8 bytes 4 bytes

Requires high bw to communicate w/ B

A B

SLIDE 32

32

63

Evaluation Methodology

Used detailed interconnection network

simulator - Garnet[1]

Built probabilistic traces

– To cover different communication patterns that may be exhibited by future applications

Leveraged Orion[4], CosiNoC[5], IPEM[2] for

power methodology

64

RF-I Enables Power Savings

Relative latency Relative power

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 uniform uniDF biDF hotbiDF 1Hotspot 2Hotspot 4Hotspot 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Static Adaptive Baseline - 8B Static - 8B Adaptive - 8B Baseline - 4B Static - 4B Adaptive - 4B

On average adaptive shortcuts w/ 50 RF-enabled routers on a 4B mesh
62%(82%) power(area) savings over baseline
Performance comparable to baseline

SLIDE 33

33

65

RF-I Enables a Power Savings

Adaptive RF-I enabled NoC
Cost Effective in terms of both power and performance

66

Outline

Application Diversity and NoC Motivation
Adaptive RF-Interconnect in NoC Design

– RF-I Overlaid on a Mesh NoC – Shortcut Selection

Architecture implications

– Performance improvement – Power Savings – Efficient multicast support – Deadlock

Conclusions and Future Work

SLIDE 34

34

67

RF-I Enabled Multicast

Get S 2 1 3 4 2 1 1 1 1 1

FILL

Fill Conventional NoC Request Scenario

Rx Rx Tx Rx Tx Rx Tx Rx Tx Rx Tx Rx Tx Rx Tx Rx Tx Tx

RF-I enabled NoC

68

Virtual Circuit Tree Multicast [6]

Demonstrated the importance of multicast for

current/future NoCs

Enhance NoC routers with additional state

tables

– Store routing information for dynamically discovered multicast trees – Dynamically spawn packets to follow communication tree

Reduce NoC congestion and multicast latency

SLIDE 35

35

69

RF-I Enabled Multicast

RF-I provides natural means for multicast
Multiple receivers listen to same channel
Conceptually, RF shortcut with multiple

destinations

70

Multicast in our architecture

50 RF-enabled

routers

16 adaptive

shortcuts

34 routers left to

tune to the multicast channel

Allocate 1 channel

for multicast

15 adaptive

shortcuts

35 routers left to

tune to the multicast channel

SLIDE 36

36

71

RF-I Enabled Multicast (cont)

We accelerate two multicast messages: Fills and

Invalidates

Both of which are issued by cache banks

– Limit multicast senders to be cache banks

We use coarse-grain arbitration scheme

– to decide which component can use the multicast channel – A cache bank in a cluster is chosen as the designated multicast sender for a fixed period of time – The caches sent multicast message to the designated – sender over conventional wires

72

Designated MC Sender Wants to send MC msgs

TRANSMIT RECEIVE

MC recipients

Example Multicast(MC) Scenario

SLIDE 37

37

73

Multicast Results

On average RF-I MC+ SC provides:
37% reduction in latency
at a cost of 25% increase in NoC Power
20 and 50 indicates:
% of distinct source-destination pairs
simulate multicast destination reuse
Perform a fair comparison with VCT

74

Unified Analysis

SLIDE 38

38

75

Outline

Application Diversity and NoC Motivation
Adaptive RF-Interconnect in NoC Design

– RF-I Overlaid on a Mesh NoC – Shortcut Selection

Architecture implications

– Performance improvement – Power Savings – Efficient multicast support – Deadlock

Conclusions and Future Work

76

Deadlock: To Avoid or Confront?

South-Last Strategy [Ogras and Marculescu, 2006]

– Routes which can lead to circular buffer dependence are forbidden avoids deadlock – Restricts shortcut selection

Results show that it effectively halves performance
Deadlock Detection & Recovery (DDR)

– Based on Duato and Pinkston’s theory [Duato and Pinkston 2001]

If deadlock occurs, route all packets in the network on a spare virtual

channel

Use deadlock-free XY-routing
Packets entering network after this point may be routed normally

SLIDE 39

39

77

How to detect deadlock…?

Rather than detect that deadlock has occurred

– Detect a sufficient condition for deadlock: circular buffer dependency

Each router maintains a list of other routers waiting on

it

When buffer at neighbor router d is full, sender s

transmits waiting-list message to neighbor

– Bit vector indicating which routers are waiting on s, as well as s’s ID – If a router is “waiting on itself,” circular buffer dependency has occurred

Raise DEADLOCK condition

– If d’s buffer empties, s sends one time clear-waiting-list message to reset state

78

Deadlock Detection Example

N inbound buffer at R21 fills up
R11 can’t send to R21
R11 tells R21:

– {R11} waiting on you

W inbound buffer at R11 fills up
R10 tells R11:

– {R10} waiting on you

R11 tells R21:

– {R10,R11} waiting on you

If there is circular dependence

– R21 will eventually see that it is waiting on itself DEADLOCK!

… …

R11 R21

in

ut

in

ut

in

ut

R10

in

ut

… … … … … …

R0 R1 R10 R11 R21

… … 1 1

R0 R1 R10 R11 R21

… … 1

R0 R1 R10 R11 R21

… … 1 1

R0 R1 R10 R11 R21

… … 1 1

SLIDE 40

40

79

Outline

Application Diversity and NoC Motivation
Adaptive RF-Interconnect in NoC Design

– RF-I Overlaid on a Mesh NoC – Shortcut Selection

Architecture implications

– Performance improvement – Power Savings – Efficient multicast support – Deadlock

Conclusions and Future Work

80

Conclusion

RF-Interconnect:

Enables adaptive NoC

– Bandwidth can be flexibly allocated – To match the communication demands of applications

Offers dramatic power and area savings

– By simplifying baseline NoC topology – Provides performance of 16B mesh on a 4B mesh

62% power savings, 82% area savings
Natural means of multicast

SLIDE 41

41

81

Future Directions

Fine-Grain Adaptation

– On-Demand or Phase-Specific Shortcut/Multicast

Message-Based Acceleration

– Application-Specific Synchronization – Cache Coherence – NUCA Migration

Deadlock Free Routing

– Application-Specific Turn Removal

Physical Implementation

82

References

[1] N. Agarwal, L-S Peh, and N. Jha. Garnet: A detailed interconnection network model inside a full-system simulation framework. Technical Report CE-P08-001, Dept.

f Electrical Engineering, Princeton University, 2007.

[2] M. Chang, J. Cong, A. Kaplan, M. Naik, G. Reinman, E. Socher and S.W. Tam, “CMP Network-

n-Chip Overlaid With Multi-Band RF-Interconnect,” The 14th International Symposium on

High-Performance Computer Architecture, Salt Lake City, UT, pp. 191-202, February 2008. (Best Paper Award) [3] M. F. Chang, J. Cong, A. Kaplan, C. Liu, M. Naik, J. Premkumar, G. Reinman, E. Socher, and R. Tam, “Power Reduction of CMP Communication Networks via RFInterconnects,” Proceedings

f the 41st Annual International Symposium on Microarchitecture (MICRO), Lake Como, Italy,

pages 376-387, November 2008 [4] J. Cong and D.Z. Pan. Interconnect estimation and planning for deep submicron designs. In Proceedings of DAC-36, 1999. [5] J.D. Owens, W.J. Dally, R. Ho, D.N. Jayasimha, S.W. Keckler, and L-S. Peh. Research challenges for on-chip interconnection networks. IEEE Micro, 27(5):96–108, 2007. [6] A. Pinto, L. Carloni, and A. Sangiovanni-Vincentelli. Constraint-driven communication synthesis. In Design Automation Conference, June 2002 [7] H. Wang, X. Zhu, L-S. Peh, and S. Malik. Orion: A power performance simulator for interconnection networks. In Proceedings of MICRO-35, November 2002. [8] N. Jerger, L. Peh, and M. Lipasti. Virtual Circuit Tree Multicasting: A Case for Hardware Multicast Support. International Symposium on Computer Architecture, June 2008. For updated slides of this tutorial, please go to http://cadlab.cs.ucla.edu/~cong

RF-Interconnect RF-Interconnect and its Applications to and its Applications to NoC Design NoC Design

RF-Interconnect RF-Interconnect

Outline

development trends

Current Trend in CMP

Future CMP Development

Trends:

Issues:

Scaling of Traditional Interconnect

Baseband Interconnect Issues

Communication Challenges

How Can RF Help?

UCLA 90nm CMOS VCO at 324GHz

(ISSCC 2008*)

4xf0 by Linear Superposition

4xfo by Linear Superposition

Communication beyond Baseband

generated and modulated by modern CMOS to enable simultaneous multiband communications with higher aggregate data rate

chip near-field antennas can guide waves (RF modulated data) from transmitter to receiver with recoverable attenuation in short distances (<30cm)

modulated using modern CMOS to enable simultaneous multiband communications with higher aggregate data rate

digital noise and cause less frequency dispersion across the band

Multiband Communications

Advantages:

RF-Interconnect Concept

Differential Transmission Line

Multiband FDMA-Interconnect

FDMA-Interconnect System

Incredible CMOS Bandwidth

Advantages of RF-I over Parallel Bus

RF-Interconnect Demonstrations

band Communications through RF- Interconnect (ISSCC 05)

band carriers (RFIC 08)

communications (VLSI 09)

Off-Chip FDMA Links (ISSCC 05*)

3D-IC Layer-to-Layer RF- Interconnect (ISSCC 07*)

Simultaneous Sub-harmonic Injection Locked mm-Wave Frequency Generation *(RFIC 2008)

Sub-harmonic Injection Locked VCO*

Simultaneous Sub-harmonic Injection Locked Multi-Frequency Generation (RFIC 2008)

RF-I using ASK Modulation

Simulated RF-I using ASK Modulation

Tri-Band On-Chip RF-Interconnect

Tri-band FDMA-Interconnect Layout

Tri-band On-Chip RF-I Test Results

Inter-channel Modulation in multi-band ASK Transmitter

Base Band Common Mode Interconnect*

Multi-Band ASK Receiver

Signal to Interference Ratio (SIR)

Multi-band ASK RF-I Scaling

Comparison between Repeated Bus and Multi-band RF-I @ 32nm

RF-I for CMP Inter-Core Communications

RC/RF/Optical Interconnect Comparison

RF-I Fill the Technology Gap between RC Repeater and Optical Interconnect*

Quick RF-I Summary

RF-I Enabled NoC Communication Architecture

Outline

– RF-I Overlaid on a Mesh NoC – Shortcut Selection

– Performance improvement – Power Savings – Efficient multicast support – Deadlock

Outline

– RF-I Overlaid on a Mesh NoC – Shortcut Selection

– Performance improvement – Power Savings – Efficient multicast support – Deadlock

Communication Diversity

Architectural Considerations for RF-I

Baseline Architecture

Quantifying Application Diversity

– Black:

no traffic

– Dark Blue: [1, mean / 4) – Light Blue: [mean/4 , mean/2) – White :

[mean/2, 2*mean)

– Orange:

[2*mean, 4*mean]

– Red:

(4*xmean, inf)

Messages Sent between Routers

Messages Sent between Routers

Outline

– RF-I Overlaid on a Mesh NoC – Shortcut Selection

– Performance improvement – Power Savings – Efficient multicast support – Deadlock

RF-I Physical Organization

RF-Enabled Routers

R

[2mean, 4mean]