Hermes-A: An Asynchronous NoC Router with Distributed Routing - - PowerPoint PPT Presentation

hermes a an asynchronous noc router with distributed
SMART_READER_LITE
LIVE PREVIEW

Hermes-A: An Asynchronous NoC Router with Distributed Routing - - PowerPoint PPT Presentation

Hermes-A: An Asynchronous NoC Router with Distributed Routing Julian Pontes Matheus Moreira Fernando Moraes Ney Calazans 1 Outline Introduction Related Work Architecture Input Port Path Calculation Output


slide-1
SLIDE 1

Hermes-A: An Asynchronous NoC Router with Distributed Routing

Julian Pontes Matheus Moreira Fernando Moraes Ney Calazans

1

slide-2
SLIDE 2

Outline

  • Introduction
  • Related Work
  • Architecture

– Input Port

  • Path Calculation

– Output Port

  • Output Control
  • Results
  • Future Work
  • Conclusions

2

slide-3
SLIDE 3

Introduction

3

  • Dual Rail Encoding
  • Four Phase Protocol
  • DIMS Logic

CD CD R e g R e g Logic Logic R e g R e g CD CD R e g R e g A B S SF 1 ST 1 ST 1 1 SF

slide-4
SLIDE 4

Introduction

  • Asynchronous Circuits

– Less Simultaneous Switching ☺

  • Less EMI
  • Less IR Drop (Slight PowerPlan)
  • Less Peak Power (No Decap Cells)
  • Less Crosstalk Problems in Data Links ??? (DI

Codes - Four Phase) (Partial Shielding in data links)

– Average Case Delay ☺ – Reduce Dynamic Power (5 times – 65nm comparison) ☺

4

slide-5
SLIDE 5

Introduction

  • Asynchronous Circuits

– Area and Leakage Overhead (~5-3 times more – 65nm) – Lack of CAD Tools and Standards

  • Synthesis Tools

– Traditional Tools (~45 Thousand loop breakers in a 3x3 NoC) – Asynchronous Synthesis Tools (Balsa, Teak) » Lack of traditional optimizations (Pin Swapping, Reordering, Retime, …)

  • STA

– Liberty File Support (is_async_reg) – New Set of Constraints (Cycle Time Definition)

5

slide-6
SLIDE 6

Introduction

  • Networks on Chip

– Offer large communication parallelism – Can provide alternate paths

  • Asynchronous Network on Chip

– Enable the Design of Complex GALS Systems on Chip

6

slide-7
SLIDE 7

Objetive

  • Design an asynchronous router

architecture capable to support the design

  • f GALS Systems

– High Throughput – Low Power

– Permit the implementation of fine grain control power » MVS » Power Shut-Off

7

slide-8
SLIDE 8

Related Work

Characteristics

  • Topology

Routing / Flow Control Network Interface Asynchronous Style Links and encoding Implementati

  • n

NoC

  • As. QNoC

2D Mesh (Irreg/Reg) 8VCs Source / wormhole / credit-based with preemption

N.A

4-phase bundled-data 10-bit flits 180 nm, 200Mflits/s, ASIC RasP Framework / point-to- point (Irreg/Reg) Source / bit serial Ad hoc QDI Point-to- point pipelined serial links 180nm, 700Mb/s ASPIN 2D Mesh (Reg) Distributed XY / wormhole / EOP A2S, S2A FIFOs Bundled-data/ QDI Dual-rail, 4- ph., 34-bit flits 90nm, 714Mflits/s ANoC 2D Mesh 2VCs Source / Adaptive

  • QDI

One of Four 130nm/ 5Gb/s (router) Hermes-A 2D Mesh Distributed XY / wormhole / BOP- EOP Dual-Rail SCAFFI QDI Dual-Rail 180nm, 727Mbits/s, (454Mflits/s 3.6Gb/s per router) ASIC

8

slide-9
SLIDE 9

Related Work

Characteristics

  • Topology

Routing / Flow Control Network Interface Asynchronous Style Links and encoding Implementati

  • n

NoC

  • As. QNoC

2D Mesh (Irreg/Reg) 8VCs Source / wormhole / credit-based with preemption

N.A

4-phase bundled-data 10-bit flits 180 nm, 200Mflits/s, ASIC RasP Framework / point-to- point (Irreg/Reg) Source / bit serial Ad hoc QDI Point-to- point pipelined serial links 180nm, 700Mb/s ASPIN 2D Mesh (Reg) Distributed XY / wormhole / EOP A2S, S2A FIFOs Bundled-data/ QDI Dual-rail, 4- ph., 34-bit flits 90nm, 714Mflits/s ANoC 2D Mesh 2VCs Source / Adaptive

  • QDI

One of Four 130nm/ 5Gb/s (router) Hermes-A 2D Mesh Distributed XY / wormhole / BOP- EOP Dual-Rail SCAFFI QDI Dual-Rail 180nm, 727Mbits/s, (454Mflits/s 3.6Gb/s per router) ASIC

9

slide-10
SLIDE 10

Related Work

Characteristics

  • Topology

Routing / Flow Control Network Interface Asynchronous Style Links and encoding Implementati

  • n

NoC

  • As. QNoC

2D Mesh (Irreg/Reg) 8VCs Source / wormhole / credit-based with preemption

N.A

4-phase bundled-data 10-bit flits 180 nm, 200Mflits/s, ASIC RasP Framework / point-to- point (Irreg/Reg) Source / bit serial Ad hoc QDI Point-to- point pipelined serial links 180nm, 700Mb/s ASPIN 2D Mesh (Reg) Distributed XY / wormhole / EOP A2S, S2A FIFOs Bundled-data/ QDI Dual-rail, 4- ph., 34-bit flits 90nm, 714Mflits/s ANoC 2D Mesh 2VCs Source / Adaptive

  • QDI

One of Four 130nm/ 5Gb/s (router) Hermes-A 2D Mesh Distributed XY / wormhole / BOP- EOP Dual-Rail SCAFFI QDI Dual-Rail 180nm, 727Mbits/s, (454Mflits/s 3.6Gb/s per router) ASIC

10

slide-11
SLIDE 11

Related Work

Characteristics

  • Topology

Routing / Flow Control Network Interface Asynchronous Style Links and encoding Implementati

  • n

NoC

  • As. QNoC

2D Mesh (Irreg/Reg) 8VCs Source / wormhole / credit-based with preemption

N.A

4-phase bundled-data 10-bit flits 180 nm, 200Mflits/s, ASIC RasP Framework / point-to- point (Irreg/Reg) Source / bit serial Ad hoc QDI Point-to- point pipelined serial links 180nm, 700Mb/s ASPIN 2D Mesh (Reg) Distributed XY / wormhole / EOP A2S, S2A FIFOs Bundled-data/ QDI Dual-rail, 4- ph., 34-bit flits 90nm, 714Mflits/s ANoC 2D Mesh 2VCs Source / Adaptive

  • QDI

One of Four 130nm/ 5Gb/s (router) Hermes-A 2D Mesh Distributed XY / wormhole / BOP- EOP Dual-Rail SCAFFI Clock Stretching QDI Dual-Rail 180nm, 727Mbits/s, (454Mflits/s 3.6Gb/s per router) ASIC

11

slide-12
SLIDE 12

Related Work

Characteristics

  • Topology

Routing / Flow Control Network Interface Asynchronous Style Links and encoding Implementati

  • n

NoC

  • As. QNoC

2D Mesh (Irreg/Reg) 8VCs Source / wormhole / credit-based with preemption

N.A

4-phase bundled-data 10-bit flits 180 nm, 200Mflits/s, ASIC RasP Framework / point-to- point (Irreg/Reg) Source / bit serial Ad hoc QDI Point-to- point pipelined serial links 180nm, 700Mb/s ASPIN 2D Mesh (Reg) Distributed XY / wormhole / EOP A2S, S2A FIFOs Bundled-data/ QDI Dual-rail, 4- ph., 34-bit flits 90nm, 714Mflits/s ANoC 2D Mesh 2VCs Source / Adaptive

  • QDI

One of Four 130nm/ 5Gb/s (router) Hermes-A 2D Mesh Distributed XY / wormhole / BOP- EOP Dual-Rail SCAFFI Clock Stretching QDI Dual-Rail 180nm, 727Mbits/s, (454Mflits/s 3.6Gb/s per router) ASIC

12

slide-13
SLIDE 13

Related Work

Characteristics

  • Topology

Routing / Flow Control Network Interface Asynchronous Style Links and encoding Implementati

  • n

NoC

  • As. QNoC

2D Mesh (Irreg/Reg) 8VCs Source / wormhole / credit-based with preemption

N.A

4-phase bundled-data 10-bit flits 180 nm, 200Mflits/s, ASIC RasP Framework / point-to- point (Irreg/Reg) Source / bit serial Ad hoc QDI Point-to- point pipelined serial links 180nm, 700Mb/s ASPIN 2D Mesh (Reg) Distributed XY / wormhole / EOP A2S, S2A FIFOs Bundled-data/ QDI Dual-rail, 4- ph., 34-bit flits 90nm, 714Mflits/s ANoC 2D Mesh 2VCs Source / Adaptive

  • QDI

One of Four 130nm/ 5Gb/s (router) Hermes-A 2D Mesh Distributed XY / wormhole / BOP- EOP Dual-Rail SCAFFI Clock Stretching QDI Dual-Rail 180nm, 727Mbits/s, (454Mflits/s 3.6Gb/s per router) ASIC

13

slide-14
SLIDE 14

Related Work

Characteristics

  • Topology

Routing / Flow Control Network Interface Asynchronous Style Links and encoding Implementati

  • n

NoC

  • As. QNoC

2D Mesh (Irreg/Reg) 8VCs Source / wormhole / credit-based with preemption

N.A

4-phase bundled-data 10-bit flits 180 nm, 200Mflits/s, ASIC RasP Framework / point-to- point (Irreg/Reg) Source / bit serial Ad hoc QDI Point-to- point pipelined serial links 180nm, 700Mb/s ASPIN 2D Mesh (Reg) Distributed XY / wormhole / EOP A2S, S2A FIFOs Bundled-data/ QDI Dual-rail, 4- ph., 34-bit flits 90nm, 714Mflits/s ANoC 2D Mesh 2VCs Source / Adaptive

  • QDI

One of Four 130nm/ 5Gb/s (router) Hermes-A 2D Mesh Distributed XY / wormhole / BOP- EOP Dual-Rail SCAFFI Clock Stretching QDI Dual-Rail 180nm, 727Mbits/s, (454Mflits/s 3.6Gb/s per router) ASIC

14

slide-15
SLIDE 15

Router Architecture

  • Distributed Routing
  • Independent Ports
  • Dual Rail Encoding
  • Weak Conditioned

Half Buffer

  • DIMS Logic

15

slide-16
SLIDE 16

Input Port

  • Packet

– First Flit contains the address – BOP and EOP delimiters – Three main paths

  • First Flit (1), Last Flit (3) and other Flits (2)

16

slide-17
SLIDE 17

Input Port

10

17

slide-18
SLIDE 18

Path Calculation

  • All logic employs

Delay Insensitive Minterm Synthesis

  • First Flit contains

the XY address

18

slide-19
SLIDE 19

Input Port

14

19

slide-20
SLIDE 20

Input Port

4 10

20

slide-21
SLIDE 21

Input Port

4 10

21

slide-22
SLIDE 22

Input Port

14

22

slide-23
SLIDE 23

Input Port

4 10

23

slide-24
SLIDE 24

Input Port

4 10

24

slide-25
SLIDE 25

Input Port

4 10

25

slide-26
SLIDE 26

Input Port

14

26

slide-27
SLIDE 27

Input Port

4 10

27

slide-28
SLIDE 28

Input Port

4 K

28

slide-29
SLIDE 29

Input Port

14

29

slide-30
SLIDE 30

Input Port

K

30

slide-31
SLIDE 31

Input Port

S-Control

31

slide-32
SLIDE 32

S-Element - Enclosure

  • Starts with a handshake at

the input port

  • Perform two handshakes

– First to send the last flit – Second to close the communication section at the

  • utput port (EOP = BOP = 1)
  • Speed Independent Design

– Circuit generated with Petrify

32

slide-33
SLIDE 33

S-Control

LAST FLIT LAST FLIT BOP = EOP =1 S-Control INPUT S-Control – Output Port A Ack A S-Control – Output Port B AckB Input Ack

33

slide-34
SLIDE 34

Output Port

  • Arbitration
  • Kill Section

34

slide-35
SLIDE 35

Output Port

35

slide-36
SLIDE 36

Synchronization Scaffi Interface – ICCD 2007

Async Router In North Out North Out West In West In South Out South In East Out East Network Interface Bop Eop Ack DR Data IP I n L

  • c

a l O u t L

  • c

a l

  • Clock Stretching Technique
  • Single to Dual –

at IP OutPut Port

  • Dual to Single at IP Input Port

36

slide-37
SLIDE 37

Results

  • 8-bit flits Router ASIC Implementation
  • XFab 180nm technology
  • Typical Operating Conditions: 25°C, 1.8

Volts - Typical Transistor Models

  • Power results obtained when all router

input and output ports operating at highest rate

37

slide-38
SLIDE 38

Results

Throughput (Mbits/s) per link Area (mm2 ) Cell – Total Area Total Power (mW) 727 0.21 – 0.33 11.14 Total Router Throughput = 5 * 727Mbits/s = 3.6Gbits/s

38

slide-39
SLIDE 39

Future Work

  • Adaptive Algorithms
  • Fine Low Power Control over Ports
  • Comparing Asynchronous Implementations

– C-elements: MullerxSymmetricxConventional – Weak Conditioned Half Buffer x PreCharge Half Buffer x PreCharge Full Buffer – Completion Detection – Dynamic Logic at Path Calculation – DI Codes

39

slide-40
SLIDE 40

Conclusions

  • Hermes-A→fast asynchronous router
  • Described with a Fully Configurable RTL

Code (NoC size, Flit Size, Buffer Size, …)

  • Fully implemented with conventional CAD

tools (Cadence framework)

  • Independent Ports ideal to traffic and

power adaptation

40

slide-41
SLIDE 41

Questions???

41

slide-42
SLIDE 42

Input Port

42