I nterconnect-Centric Computing William J. Dally Computer Systems - - PowerPoint PPT Presentation

i nterconnect centric computing
SMART_READER_LITE
LIVE PREVIEW

I nterconnect-Centric Computing William J. Dally Computer Systems - - PowerPoint PPT Presentation

I nterconnect-Centric Computing William J. Dally Computer Systems Laboratory Stanford University HPCA Keynote February 12, 2007 HPCA: 1 Feb 12, 2007 Outline Interconnection Networks (INs) are THE central component of modern computer


slide-1
SLIDE 1

HPCA: 1 Feb 12, 2007

I nterconnect-Centric Computing

William J. Dally Computer Systems Laboratory Stanford University HPCA Keynote February 12, 2007

slide-2
SLIDE 2

HPCA: 2 Feb 12, 2007

Outline

  • Interconnection Networks (INs) are THE central

component of modern computer systems

  • Topology driven to high-radix by packaging

technology

  • Global adaptive routing balances load - and enables

efficient topologies

  • Case study, the Cray Black Widow
  • On-Chip Interconnection Networks (OCINs) face

unique challenges

  • The road ahead…
slide-3
SLIDE 3

HPCA: 3 Feb 12, 2007

Outline

  • Interconnection Networks (INs) are THE central

component of modern computer systems

  • Topology driven to high-radix by packaging

technology

  • Global adaptive routing balances load - and enables

efficient topologies

  • Case study, the Cray Black Widow
  • On-Chip Interconnection Networks (OCINs) face

unique challenges

  • The road ahead…
slide-4
SLIDE 4

HPCA: 4 Feb 12, 2007

I Ns: Connect Processors in Clusters

IBM Blue Gene

slide-5
SLIDE 5

HPCA: 5 Feb 12, 2007

and on chip

MIT RAW

slide-6
SLIDE 6

HPCA: 6 Feb 12, 2007

Connect Processors to Memories in Systems

Cray Black Widow

slide-7
SLIDE 7

HPCA: 7 Feb 12, 2007

and on chip

Texas TRIPS

slide-8
SLIDE 8

HPCA: 8 Feb 12, 2007

provide the fabric for network Switches and Routers

Avici TSR

slide-9
SLIDE 9

HPCA: 9 Feb 12, 2007

and connect I / O Devices

Brocade Switch

slide-10
SLIDE 10

HPCA: 10 Feb 12, 2007

Group History: Routing Chips & I nterconnection Networks

  • Mars Router, Torus Routing Chip, Network Design

Frame, Reliable Router

  • Basis for Intel, Cray/SGI, Mercury, Avici network chips

MARS Router 1984 Torus Routing Chip 1985 Network Design Frame 1988 Reliable Router 1994

slide-11
SLIDE 11

HPCA: 11 Feb 12, 2007

Group History: Parallel Computer Systems

  • J-Machine (MDP) led to Cray T3D/T3E
  • M-Machine (MAP)

– Fast messaging, scalable processing nodes, scalable memory architecture

  • Imagine – basis for SPI

MDP Chip J-Machine Cray T3D MAP Chip Imagine Chip

slide-12
SLIDE 12

HPCA: 12 Feb 12, 2007

I nterconnection Networks are THE Central Component of Modern Computer Systems

  • Processors are a commodity

– Performance no longer scaling (ILP mined out) – Future growth is through CMPs - connected by INs

  • Memory is a commodity

– Memory system performance determined by interconnect

  • I/O systems are largely interconnect
  • Embedded systems built using SoCs

– Standard components – Connected by on-chip INs (OCINs)

slide-13
SLIDE 13

HPCA: 13 Feb 12, 2007

Outline

  • Interconnection Networks (INs) are THE central

component of modern computer systems

  • Topology driven to high-radix by packaging

technology

  • Global adaptive routing balances load - and enables

efficient topologies

  • Case study, the Cray Black Widow
  • On-Chip Interconnection Networks (OCINs) face

unique challenges

  • The road ahead…
slide-14
SLIDE 14

HPCA: 14 Feb 12, 2007

0.1 1 10 100 1000 10000 1985 1990 1995 2000 2005 2010

year bandwidth per router node (Gb/s)

Torus Routing Chip Intel iPSC/2 J-Machine CM-5 Intel Paragon XP Cray T3D MIT Alewife IBM Vulcan Cray T3E SGI Origin 2000 AlphaServer GS320 IBM SP Switch2 Quadrics QsNet Cray X1 Velio 3003 IBM HPS SGI Altix 3000 Cray XT3 YARC

BlackWidow

Technology Trends…

slide-15
SLIDE 15

HPCA: 15 Feb 12, 2007

High-Radix Router

Router Router

slide-16
SLIDE 16

HPCA: 16 Feb 12, 2007

High-Radix Router

Router Router

Low-radix (small number of fat ports) High-radix (large number of skinny ports)

Router Router

slide-17
SLIDE 17

HPCA: 17 Feb 12, 2007

4 hops 2 hops 96 channels 32 channels

Low-Radix vs. High-Radix Router

O0 O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 O11 O12 O13 O14 O15 I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 O0 O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 O11 O12 O13 O14 O15

Low-Radix High-Radix Latency : Cost :

slide-18
SLIDE 18

HPCA: 18 Feb 12, 2007

Latency

Latency =

H tr + L / b = 2trlogkN + 2kL / B

where k = radix B = total router Bandwidth N = # of nodes L = message size

slide-19
SLIDE 19

HPCA: 19 Feb 12, 2007

Latency vs. Radix

50 100 150 200 250 300 50 100 150 200 250

radix latency (nsec)

2003 technology 2010 technology

Optimal radix ~ 40 Optimal radix ~ 128

Serialization latency increases Header latency decreases

slide-20
SLIDE 20

HPCA: 20 Feb 12, 2007

Determining Optimal Radix

Latency = Header Latency + Serialization Latency =

H tr + L / b = 2trlogkN + 2kL / B

Optimal radix

k log2 k = (B tr log N) / L

= Aspect Ratio

where k = radix B = total router Bandwidth N = # of nodes L = message size

slide-21
SLIDE 21

HPCA: 21 Feb 12, 2007

Higher Aspect Ratio, Higher Optimal Radix

1996 2003 2010 1991 1 10 100 1000 10 100 1000 10000

Aspect Ratio Optimal Radix (k)

slide-22
SLIDE 22

HPCA: 22 Feb 12, 2007

High-Radix Topology

  • Use high radix, k, to get low hop count

– H = logk(N)

  • Provide good performance on both benign and

adversarial traffic patterns

– Rules out butterfly networks - no path diversity – Clos networks work well

  • H = 2logk(N) - with short circuit

– Cayley graphs have nice properties but are hard to route

slide-23
SLIDE 23

HPCA: 23 Feb 12, 2007

Example radix-64 Clos Network

Y0

BW0 BW1 BW31

Y31

BW992 BW993

BW1023

Y1

BW32 BW33 BW63

Y32 Y33 Y63

Rank 1 Rank 2

slide-24
SLIDE 24

HPCA: 24 Feb 12, 2007

Flattened Butterfly Topology

slide-25
SLIDE 25

HPCA: 25 Feb 12, 2007

Packaging the Flattened Butterfly

slide-26
SLIDE 26

HPCA: 26 Feb 12, 2007

Packaging the Flattened Butterfly (2)

slide-27
SLIDE 27

HPCA: 27 Feb 12, 2007

Cost

slide-28
SLIDE 28

HPCA: 28 Feb 12, 2007

Outline

  • Interconnection Networks (INs) are THE central

component of modern computer systems

  • Topology driven to high-radix by packaging

technology

  • Global adaptive routing balances load - and enables

efficient topologies

  • Case study, the Cray Black Widow
  • On-Chip Interconnection Networks (OCINs) face

unique challenges

  • The road ahead…
slide-29
SLIDE 29

HPCA: 29 Feb 12, 2007

Routing in High-Radix Networks

  • Adaptive routing avoids transient load imbalance
  • Global adaptive routing balances load for adversarial

traffic

– Cost/perf of a butterfly on benign traffic and at low loads – Cost/perf of a clos on adversarial traffic

slide-30
SLIDE 30

HPCA: 30 Feb 12, 2007

A Clos can statically load balance traffic using oblivious routing

Y0

BW0 BW1 BW31

Y31

BW992 BW993

BW1023

Y1

BW32 BW33 BW63

Y32 Y33 Y63

Rank 1 Rank 2

slide-31
SLIDE 31

HPCA: 31 Feb 12, 2007

Transient I mbalance

slide-32
SLIDE 32

HPCA: 32 Feb 12, 2007

With Adaptive Routing

slide-33
SLIDE 33

HPCA: 33 Feb 12, 2007

Latency for UR traffic

slide-34
SLIDE 34

HPCA: 34 Feb 12, 2007

Flattened Butterfly Topology

1 2 3 4 5 6 7

slide-35
SLIDE 35

HPCA: 35 Feb 12, 2007

Flattened Butterfly Topology

1 2 3 4 5 6 7 What if node 0 sends all of its traffic to node 1?

slide-36
SLIDE 36

HPCA: 36 Feb 12, 2007

Flattened Butterfly Topology

1 2 3 4 5 6 7 What if node 0 sends all of its traffic to node 1? How much traffic should we route over alternate paths?

slide-37
SLIDE 37

HPCA: 37 Feb 12, 2007

Simpler Case - ring of 8 nodes Send traffic from 2 to 5

  • Model: Assume queues to be a network of

independent M/D/1 queues

2 1 3 4 5 6 7 x1 x2

= x1 + x2

Min path delay = Dm(x1) Non-min path delay = Dnm(x2)

  • Routing remains minimal as long as

Dm’ () Dnm’ (0)

  • Afterwards, route a fraction, x2, non-

minimally such that

Dm’ (x1) = Dnm’ (x2)

slide-38
SLIDE 38

HPCA: 38 Feb 12, 2007

Traffic divides to balance delay Load balanced at saturation

0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.2 0.3 0.4 0.5 0.6

Offered Load (fraction of capacity) Accepted Throughput

Model Overall Model Minimal Model Non-minimal

slide-39
SLIDE 39

HPCA: 39 Feb 12, 2007

Channel-Queue Routing

  • Estimate delay per hop by local queue length Qi
  • Overall latency estimated by

– Li ~ QiHi

  • Route each packet on route with lowest estimated Li
  • Works extremely well in practice
slide-40
SLIDE 40

HPCA: 40 Feb 12, 2007

Performance on UR Traffic

slide-41
SLIDE 41

HPCA: 41 Feb 12, 2007

Performance on WC Traffic

slide-42
SLIDE 42

HPCA: 42 Feb 12, 2007

Allocator Design Matters

slide-43
SLIDE 43

HPCA: 43 Feb 12, 2007

Outline

  • Interconnection Networks (INs) are THE central

component of modern computer systems

  • Topology driven to high-radix by packaging

technology

  • Global adaptive routing balances load - and enables

efficient topologies

  • Case study, the Cray Black Widow
  • On-Chip Interconnection Networks (OCINs) face

unique challenges

  • The road ahead…
slide-44
SLIDE 44

HPCA: 44 Feb 12, 2007

Putting it all together The Cray BlackWidow Network

In collaboration with Steve Scott and Dennis Abts (Cray Inc.)

slide-45
SLIDE 45

HPCA: 45 Feb 12, 2007

Cray Black Widow

  • Shared-memory vector parallel computer
  • Up to 32K nodes
  • Vector processor per node
  • Shared memory across nodes
slide-46
SLIDE 46

HPCA: 46 Feb 12, 2007

Black Widow Topology

  • Up to 32K nodes in a 3-level

folded Clos

  • Each node has 4 18.75Gb/s

channels, one to each of 4 network slices

slide-47
SLIDE 47

HPCA: 47 Feb 12, 2007

YARC Yet Another Router Chip

  • 64 Ports
  • Each port is 18.75 Gb/s (3 x 6.25Gb/s links)
  • Table-driven routing
  • Fault tolerance

– CRC with link-level retry – Graceful degradation of links

  • 3 bits -> 2 bits -> 1 bit -> OTS
slide-48
SLIDE 48

HPCA: 48 Feb 12, 2007

YARC Microarchitecture

  • Regular 8x8 array of tiles

– Easy to lay out chip

  • No global arbitration

– All decisions local

  • Simple routing
  • Hierarchical organization

– Input buffers – Row buffers – Column buffers

slide-49
SLIDE 49

HPCA: 49 Feb 12, 2007

A Closer Look at a Tile

  • No global arbitration
  • Non-blocking with an 8x

internal speedup in subswitch

  • Simple routing

– Small 8-entry routing table per tile – High routing throughput for small packets

slide-50
SLIDE 50

HPCA: 50 Feb 12, 2007

YARC I mplementation

  • Implemented in a 90nm

CMOS standard-cell ASIC technology

  • 192 SerDes on the chip
  • (64 ports x 3-bits per port)
  • 6.25Gbaud data rate
  • Estimated power
  • 80 W (idle)
  • 87 W (peak)
  • 17mm x 17mm die
slide-51
SLIDE 51

HPCA: 51 Feb 12, 2007

YARC I mplementation

  • Implemented in a 90nm

CMOS standard-cell ASIC technology

  • 192 SerDes on the chip
  • (64 ports x 3-bits per port)
  • 6.25Gbaud data rate
  • Estimated power
  • 80 W (idle)
  • 87 W (peak)
  • 17mm x 17mm die
slide-52
SLIDE 52

HPCA: 52 Feb 12, 2007

Outline

  • Interconnection Networks (INs) are THE central

component of modern computer systems

  • Topology driven to high-radix by packaging

technology

  • Global adaptive routing balances load - and enables

efficient topologies

  • Case study, the Cray Black Widow
  • On-Chip Interconnection Networks (OCINs) face

unique challenges

  • The road ahead…
slide-53
SLIDE 53

HPCA: 53 Feb 12, 2007

Much of the future is on-chip (CMP, SoC, Operand)

2006 2007.5 2009 2010.5 2012 2015 2013.5

slide-54
SLIDE 54

HPCA: 54 Feb 12, 2007

On-Chip Networks are Fundamentally Different

  • Different cost model

– Wires plentiful, no pin constraints – Buffers expensive (consume die area) – Slow signal propagation

  • Different usage patterns

– Particularly for SoCs

  • Significant isochronous traffic
  • Hard RT constraints
  • Different design problems

– Floorplans – Energy-efficient transmission circuits

slide-55
SLIDE 55

HPCA: 55 Feb 12, 2007

NSF Workshop I dentified 3 Critical I ssues

  • Power

– OCINs will have 10x the required power with current approaches

  • Circuit and architecture innovations can close this gap
  • Latency

– OCIN latency currently not competitive with buses and dedicated wiring

  • Novel flow-control strategies required
  • Tool Integration

– OCINs need to be integrated with standard tool flows to enable widespread use

slide-56
SLIDE 56

HPCA: 56 Feb 12, 2007

The Road Ahead

  • INs become an even more dominant system component

– Number of processors goes up, cost of processors decreases – Communication dominates performance and cost – From hand-held media UI devices to huge data centers

  • Technology drives topology in new directions

– On-chip, short reach electrical (10m), optical – Expect radix to continue to increase – Hybrid topologies to match each packaging level

  • Latency will approach that of dedicated wiring

– Better flow-control and router architecture – Optimized circuits

  • Adaptivity will optimize performance

– Balance load, route around defects, tolerate variation, tune power to load

slide-57
SLIDE 57

HPCA: 57 Feb 12, 2007

Summary

  • Interconnection Networks (INs) are THE central component of modern

computing systems

  • High-radix topologies have evolved to exploit packaging/signaling

technology

– Including hybrid optical/electrical – Flattened Butterfly

  • Global adaptive routing balances load and enables advanced topologies

– Eliminate transient load imbalance – Use local queues to estimate global congestion

  • Cray Black Widow - an example high-radix network
  • On-Chip INs

– Very different constraints – Three “Gaps” identified - power, latency, tools.

  • The road ahead

– Lots of room for improvement, INs are in their infancy

slide-58
SLIDE 58

HPCA: 58 Feb 12, 2007

Some very good books

slide-59
SLIDE 59

HPCA: 59 Feb 12, 2007

Backup

slide-60
SLIDE 60

HPCA: 60 Feb 12, 2007

Virtual Channel Router Architecture

Switch Allocator VC Allocator

Output k Crossbar switch

Router

Routing computation Output 1

VC 1 VC 2 VC v VC 1 VC 2 VC v

Input 1 Input k

Switch Allocator VC Allocator

Output k Crossbar switch

Router

Routing computation Output 1

VC 1 VC 2 VC v VC 1 VC 2 VC v

Input 1 Input k

Switch Allocator VC Allocator

Output k Crossbar switch

Router

Routing computation Output 1

VC 1 VC 2 VC v VC 1 VC 2 VC v

Input 1 Input k

Switch Allocator VC Allocator

Output k Crossbar switch

Router

Routing computation Output 1

VC 1 VC 2 VC v VC 1 VC 2 VC v

Input 1 Input k

Switch Allocator VC Allocator

Output k Crossbar switch

Router

Routing computation Output 1

VC 1 VC 2 VC v VC 1 VC 2 VC v

Input 1 Input k

Switch Allocator VC Allocator

Output k Crossbar switch

Router

Routing computation Output 1

VC 1 VC 2 VC v VC 1 VC 2 VC v

Input 1 Input k

Switch Allocator VC Allocator

Output k Crossbar switch

Router

Routing computation Output 1

VC 1 VC 2 VC v VC 1 VC 2 VC v

Input 1 Input k

slide-61
SLIDE 61

HPCA: 61 Feb 12, 2007

Baseline Performance Evaluation

10 20 30 40 50 0.2 0.4 0.6 0.8 1

  • ffered load

latency (cycles)

low-radix

slide-62
SLIDE 62

HPCA: 62 Feb 12, 2007

Baseline Performance Evaluation

10 20 30 40 50 0.2 0.4 0.6 0.8 1

  • ffered load

latency (cycles)

low-radix baseline (high-radix)

Low radix better