[PPT] - I nterconnect-Centric Computing William J. Dally Computer Systems PowerPoint Presentation

SLIDE 1

HPCA: 1 Feb 12, 2007

I nterconnect-Centric Computing

William J. Dally Computer Systems Laboratory Stanford University HPCA Keynote February 12, 2007

SLIDE 2

HPCA: 2 Feb 12, 2007

Outline

Interconnection Networks (INs) are THE central

component of modern computer systems

Topology driven to high-radix by packaging

technology

Global adaptive routing balances load - and enables

efficient topologies

Case study, the Cray Black Widow
On-Chip Interconnection Networks (OCINs) face

unique challenges

The road ahead…

SLIDE 3

HPCA: 3 Feb 12, 2007

Outline

Interconnection Networks (INs) are THE central

component of modern computer systems

Topology driven to high-radix by packaging

technology

Global adaptive routing balances load - and enables

efficient topologies

Case study, the Cray Black Widow
On-Chip Interconnection Networks (OCINs) face

unique challenges

The road ahead…

SLIDE 4

HPCA: 4 Feb 12, 2007

I Ns: Connect Processors in Clusters

IBM Blue Gene

SLIDE 5

HPCA: 5 Feb 12, 2007

and on chip

MIT RAW

SLIDE 6

HPCA: 6 Feb 12, 2007

Connect Processors to Memories in Systems

Cray Black Widow

SLIDE 7

HPCA: 7 Feb 12, 2007

and on chip

Texas TRIPS

SLIDE 8

HPCA: 8 Feb 12, 2007

provide the fabric for network Switches and Routers

Avici TSR

SLIDE 9

HPCA: 9 Feb 12, 2007

and connect I / O Devices

Brocade Switch

SLIDE 10

HPCA: 10 Feb 12, 2007

Group History: Routing Chips & I nterconnection Networks

Mars Router, Torus Routing Chip, Network Design

Frame, Reliable Router

Basis for Intel, Cray/SGI, Mercury, Avici network chips

MARS Router 1984 Torus Routing Chip 1985 Network Design Frame 1988 Reliable Router 1994

SLIDE 11

HPCA: 11 Feb 12, 2007

Group History: Parallel Computer Systems

J-Machine (MDP) led to Cray T3D/T3E
M-Machine (MAP)

– Fast messaging, scalable processing nodes, scalable memory architecture

Imagine – basis for SPI

MDP Chip J-Machine Cray T3D MAP Chip Imagine Chip

SLIDE 12

HPCA: 12 Feb 12, 2007

I nterconnection Networks are THE Central Component of Modern Computer Systems

Processors are a commodity

– Performance no longer scaling (ILP mined out) – Future growth is through CMPs - connected by INs

Memory is a commodity

– Memory system performance determined by interconnect

I/O systems are largely interconnect
Embedded systems built using SoCs

– Standard components – Connected by on-chip INs (OCINs)

SLIDE 13

HPCA: 13 Feb 12, 2007

Outline

Interconnection Networks (INs) are THE central

component of modern computer systems

Topology driven to high-radix by packaging

technology

Global adaptive routing balances load - and enables

efficient topologies

Case study, the Cray Black Widow
On-Chip Interconnection Networks (OCINs) face

unique challenges

The road ahead…

SLIDE 14

HPCA: 14 Feb 12, 2007

0.1 1 10 100 1000 10000 1985 1990 1995 2000 2005 2010

year bandwidth per router node (Gb/s)

Torus Routing Chip Intel iPSC/2 J-Machine CM-5 Intel Paragon XP Cray T3D MIT Alewife IBM Vulcan Cray T3E SGI Origin 2000 AlphaServer GS320 IBM SP Switch2 Quadrics QsNet Cray X1 Velio 3003 IBM HPS SGI Altix 3000 Cray XT3 YARC

BlackWidow

Technology Trends…

SLIDE 15

HPCA: 15 Feb 12, 2007

High-Radix Router

Router Router

SLIDE 16

HPCA: 16 Feb 12, 2007

High-Radix Router

Router Router

Low-radix (small number of fat ports) High-radix (large number of skinny ports)

Router Router

SLIDE 17

HPCA: 17 Feb 12, 2007

4 hops 2 hops 96 channels 32 channels

Low-Radix vs. High-Radix Router

O0 O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 O11 O12 O13 O14 O15 I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 O0 O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 O11 O12 O13 O14 O15

Low-Radix High-Radix Latency : Cost :

SLIDE 18

HPCA: 18 Feb 12, 2007

Latency

Latency =

H tr + L / b = 2trlogkN + 2kL / B

where k = radix B = total router Bandwidth N = # of nodes L = message size

SLIDE 19

HPCA: 19 Feb 12, 2007

Latency vs. Radix

50 100 150 200 250 300 50 100 150 200 250

radix latency (nsec)

2003 technology 2010 technology

Optimal radix ~ 40 Optimal radix ~ 128

Serialization latency increases Header latency decreases

SLIDE 20

HPCA: 20 Feb 12, 2007

Determining Optimal Radix

Latency = Header Latency + Serialization Latency =

H tr + L / b = 2trlogkN + 2kL / B

Optimal radix

k log2 k = (B tr log N) / L

= Aspect Ratio

where k = radix B = total router Bandwidth N = # of nodes L = message size

SLIDE 21

HPCA: 21 Feb 12, 2007

Higher Aspect Ratio, Higher Optimal Radix

1996 2003 2010 1991 1 10 100 1000 10 100 1000 10000

Aspect Ratio Optimal Radix (k)

SLIDE 22

HPCA: 22 Feb 12, 2007

High-Radix Topology

Use high radix, k, to get low hop count

– H = logk(N)

Provide good performance on both benign and

adversarial traffic patterns

– Rules out butterfly networks - no path diversity – Clos networks work well

H = 2logk(N) - with short circuit

– Cayley graphs have nice properties but are hard to route

SLIDE 23

HPCA: 23 Feb 12, 2007

Example radix-64 Clos Network

Y0

BW0 BW1 BW31

Y31

BW992 BW993

BW1023

Y1

BW32 BW33 BW63

Y32 Y33 Y63

Rank 1 Rank 2

SLIDE 24

HPCA: 24 Feb 12, 2007

Flattened Butterfly Topology

SLIDE 25

HPCA: 25 Feb 12, 2007

Packaging the Flattened Butterfly

SLIDE 26

HPCA: 26 Feb 12, 2007

Packaging the Flattened Butterfly (2)

SLIDE 27

HPCA: 27 Feb 12, 2007

Cost

SLIDE 28

HPCA: 28 Feb 12, 2007

Outline

Interconnection Networks (INs) are THE central

component of modern computer systems

Topology driven to high-radix by packaging

technology

Global adaptive routing balances load - and enables

efficient topologies

Case study, the Cray Black Widow
On-Chip Interconnection Networks (OCINs) face

unique challenges

The road ahead…

SLIDE 29

HPCA: 29 Feb 12, 2007

Routing in High-Radix Networks

Adaptive routing avoids transient load imbalance
Global adaptive routing balances load for adversarial

traffic

– Cost/perf of a butterfly on benign traffic and at low loads – Cost/perf of a clos on adversarial traffic

SLIDE 30

HPCA: 30 Feb 12, 2007

A Clos can statically load balance traffic using oblivious routing

Y0

BW0 BW1 BW31

Y31

BW992 BW993

BW1023

Y1

BW32 BW33 BW63

Y32 Y33 Y63

Rank 1 Rank 2

SLIDE 31

HPCA: 31 Feb 12, 2007

Transient I mbalance

SLIDE 32

HPCA: 32 Feb 12, 2007

With Adaptive Routing

SLIDE 33

HPCA: 33 Feb 12, 2007

Latency for UR traffic

SLIDE 34

HPCA: 34 Feb 12, 2007

Flattened Butterfly Topology

1 2 3 4 5 6 7

SLIDE 35

HPCA: 35 Feb 12, 2007

Flattened Butterfly Topology

1 2 3 4 5 6 7 What if node 0 sends all of its traffic to node 1?

SLIDE 36

HPCA: 36 Feb 12, 2007

Flattened Butterfly Topology

1 2 3 4 5 6 7 What if node 0 sends all of its traffic to node 1? How much traffic should we route over alternate paths?

SLIDE 37

HPCA: 37 Feb 12, 2007

Simpler Case - ring of 8 nodes Send traffic from 2 to 5

Model: Assume queues to be a network of

independent M/D/1 queues

2 1 3 4 5 6 7 x1 x2

= x1 + x2

Min path delay = Dm(x1) Non-min path delay = Dnm(x2)

Routing remains minimal as long as

Dm’ () Dnm’ (0)

Afterwards, route a fraction, x2, non-

minimally such that

Dm’ (x1) = Dnm’ (x2)

SLIDE 38

HPCA: 38 Feb 12, 2007

Traffic divides to balance delay Load balanced at saturation

0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.2 0.3 0.4 0.5 0.6

Offered Load (fraction of capacity) Accepted Throughput

Model Overall Model Minimal Model Non-minimal

SLIDE 39

HPCA: 39 Feb 12, 2007

Channel-Queue Routing

Estimate delay per hop by local queue length Qi
Overall latency estimated by

– Li ~ QiHi

Route each packet on route with lowest estimated Li
Works extremely well in practice

SLIDE 40

HPCA: 40 Feb 12, 2007

Performance on UR Traffic

SLIDE 41

HPCA: 41 Feb 12, 2007

Performance on WC Traffic

SLIDE 42

HPCA: 42 Feb 12, 2007

Allocator Design Matters

SLIDE 43

HPCA: 43 Feb 12, 2007

Outline

Interconnection Networks (INs) are THE central

component of modern computer systems

Topology driven to high-radix by packaging

technology

Global adaptive routing balances load - and enables

efficient topologies

Case study, the Cray Black Widow
On-Chip Interconnection Networks (OCINs) face

unique challenges

The road ahead…

SLIDE 44

HPCA: 44 Feb 12, 2007

Putting it all together The Cray BlackWidow Network

In collaboration with Steve Scott and Dennis Abts (Cray Inc.)

SLIDE 45

HPCA: 45 Feb 12, 2007

Cray Black Widow

Shared-memory vector parallel computer
Up to 32K nodes
Vector processor per node
Shared memory across nodes

SLIDE 46

HPCA: 46 Feb 12, 2007

Black Widow Topology

Up to 32K nodes in a 3-level

folded Clos

Each node has 4 18.75Gb/s

channels, one to each of 4 network slices

SLIDE 47

HPCA: 47 Feb 12, 2007

YARC Yet Another Router Chip

64 Ports
Each port is 18.75 Gb/s (3 x 6.25Gb/s links)
Table-driven routing
Fault tolerance

– CRC with link-level retry – Graceful degradation of links

3 bits -> 2 bits -> 1 bit -> OTS

SLIDE 48

HPCA: 48 Feb 12, 2007

YARC Microarchitecture

Regular 8x8 array of tiles

– Easy to lay out chip

No global arbitration

– All decisions local

Simple routing
Hierarchical organization

– Input buffers – Row buffers – Column buffers

SLIDE 49

HPCA: 49 Feb 12, 2007

A Closer Look at a Tile

No global arbitration
Non-blocking with an 8x

internal speedup in subswitch

Simple routing

– Small 8-entry routing table per tile – High routing throughput for small packets

SLIDE 50

HPCA: 50 Feb 12, 2007

YARC I mplementation

Implemented in a 90nm

CMOS standard-cell ASIC technology

192 SerDes on the chip
(64 ports x 3-bits per port)
6.25Gbaud data rate
Estimated power
80 W (idle)
87 W (peak)
17mm x 17mm die

SLIDE 51

HPCA: 51 Feb 12, 2007

YARC I mplementation

Implemented in a 90nm

CMOS standard-cell ASIC technology

192 SerDes on the chip
(64 ports x 3-bits per port)
6.25Gbaud data rate
Estimated power
80 W (idle)
87 W (peak)
17mm x 17mm die

SLIDE 52

HPCA: 52 Feb 12, 2007

Outline

Interconnection Networks (INs) are THE central

component of modern computer systems

Topology driven to high-radix by packaging

technology

Global adaptive routing balances load - and enables

efficient topologies

Case study, the Cray Black Widow
On-Chip Interconnection Networks (OCINs) face

unique challenges

The road ahead…

SLIDE 53

HPCA: 53 Feb 12, 2007

Much of the future is on-chip (CMP, SoC, Operand)

2006 2007.5 2009 2010.5 2012 2015 2013.5

SLIDE 54

HPCA: 54 Feb 12, 2007

On-Chip Networks are Fundamentally Different

Different cost model

– Wires plentiful, no pin constraints – Buffers expensive (consume die area) – Slow signal propagation

Different usage patterns

– Particularly for SoCs

Significant isochronous traffic
Hard RT constraints
Different design problems

– Floorplans – Energy-efficient transmission circuits

SLIDE 55

HPCA: 55 Feb 12, 2007

NSF Workshop I dentified 3 Critical I ssues

Power

– OCINs will have 10x the required power with current approaches

Circuit and architecture innovations can close this gap
Latency

– OCIN latency currently not competitive with buses and dedicated wiring

Novel flow-control strategies required
Tool Integration

– OCINs need to be integrated with standard tool flows to enable widespread use

SLIDE 56

HPCA: 56 Feb 12, 2007

The Road Ahead

INs become an even more dominant system component

– Number of processors goes up, cost of processors decreases – Communication dominates performance and cost – From hand-held media UI devices to huge data centers

Technology drives topology in new directions

– On-chip, short reach electrical (10m), optical – Expect radix to continue to increase – Hybrid topologies to match each packaging level

Latency will approach that of dedicated wiring

– Better flow-control and router architecture – Optimized circuits

Adaptivity will optimize performance

– Balance load, route around defects, tolerate variation, tune power to load

SLIDE 57

HPCA: 57 Feb 12, 2007

Summary

Interconnection Networks (INs) are THE central component of modern

computing systems

High-radix topologies have evolved to exploit packaging/signaling

technology

– Including hybrid optical/electrical – Flattened Butterfly

Global adaptive routing balances load and enables advanced topologies

– Eliminate transient load imbalance – Use local queues to estimate global congestion

Cray Black Widow - an example high-radix network
On-Chip INs

– Very different constraints – Three “Gaps” identified - power, latency, tools.

The road ahead

– Lots of room for improvement, INs are in their infancy

SLIDE 58

HPCA: 58 Feb 12, 2007

Some very good books

SLIDE 59

HPCA: 59 Feb 12, 2007

Backup

SLIDE 60

HPCA: 60 Feb 12, 2007

Virtual Channel Router Architecture

Switch Allocator VC Allocator

Output k Crossbar switch

Router

Routing computation Output 1

VC 1 VC 2 VC v VC 1 VC 2 VC v

Input 1 Input k

Switch Allocator VC Allocator

Output k Crossbar switch

Router

Routing computation Output 1

VC 1 VC 2 VC v VC 1 VC 2 VC v

Input 1 Input k

Switch Allocator VC Allocator

Output k Crossbar switch

Router

Routing computation Output 1

VC 1 VC 2 VC v VC 1 VC 2 VC v

Input 1 Input k

Switch Allocator VC Allocator

Output k Crossbar switch

Router

Routing computation Output 1

VC 1 VC 2 VC v VC 1 VC 2 VC v

Input 1 Input k

Switch Allocator VC Allocator

Output k Crossbar switch

Router

Routing computation Output 1

VC 1 VC 2 VC v VC 1 VC 2 VC v

Input 1 Input k

Switch Allocator VC Allocator

Output k Crossbar switch

Router

Routing computation Output 1

VC 1 VC 2 VC v VC 1 VC 2 VC v

Input 1 Input k

Switch Allocator VC Allocator

Output k Crossbar switch

Router

Routing computation Output 1

VC 1 VC 2 VC v VC 1 VC 2 VC v

Input 1 Input k

SLIDE 61

HPCA: 61 Feb 12, 2007

Baseline Performance Evaluation

10 20 30 40 50 0.2 0.4 0.6 0.8 1

ffered load

latency (cycles)

low-radix

SLIDE 62

HPCA: 62 Feb 12, 2007

Baseline Performance Evaluation

10 20 30 40 50 0.2 0.4 0.6 0.8 1

ffered load