Physical Implementation of the DSPIN Network-on-Chip in the FAUST - - PowerPoint PPT Presentation

physical implementation of the dspin network on chip in
SMART_READER_LITE
LIVE PREVIEW

Physical Implementation of the DSPIN Network-on-Chip in the FAUST - - PowerPoint PPT Presentation

Physical Implementation of the DSPIN Network-on-Chip in the FAUST Architecture Ivan Miro-Panades 1,2,3 , Fabien Clermidy 3 , Pascal Vivet 3 , Alain Greiner 1 1 The University of Pierre et Marie Curie, Paris, France 2 STMicroelectronics, Crolles,


slide-1
SLIDE 1

Physical Implementation of the DSPIN Network-on-Chip in the FAUST Architecture

Ivan Miro-Panades1,2,3, Fabien Clermidy3, Pascal Vivet3, Alain Greiner1

1 The University of Pierre et Marie Curie, Paris, France 2 STMicroelectronics, Crolles, France 3 CEA-Leti, MINATEC, Grenoble, France Ivan MIRO PANADES – NOCS 2008 1

slide-2
SLIDE 2

Outline

Motivation FAUST architecture Migration of DSPIN into FAUST DSPIN implementation Networks comparison

Ivan MIRO PANADES – NOCS 2008 2

slide-3
SLIDE 3

Motivation

Physically implement the DSPIN NoC into the FAUST application platform

  • DSPIN is a NoC developed between Lip6 and ST
  • FAUST is a stream-oriented application platform for 4G telecom

applications, based on ANOC, developed by CEA-Leti.

Compare the performances between ANOC and DSPIN

  • n a real application and traffic

Ivan MIRO PANADES – NOCS 2008 3

slide-4
SLIDE 4

FAUST architecture

RAM I F 58 Pads ETHERNET I F 17 Pads Async / Sync I F Async node NOC2 I F 83 Pads RX units TX units AHB units OFDM MOD. ALAM. MOD. CDMA MOD. MAPP. BI T I NTER. TURBO CODER RAM ARM946 RAM EXT. RAM CTRL

AHB

ROTOR EQUAL. CHAN. EST. CONV. DEC. ETHER NET FRAME SYNC. ODFM DEM. CDMA DEM. DE- MAPP. DE- I NTER.

EXP SPort APort

NOC1 I F 84 Pads

SPort APort

RAC NoC Perf.

EXP

CONV. CODER Clk & Reset CTRL JTAG Clk, Rst DART

23 computation units Asynchronous NoC (ANOC) 20 ANOC routers GALS conception 24 independent Clks Ethernet port Internal/External RAM CPU ARM946ES Cache 4KB-I 4KB-D Hardware OFDM modulation/demodulation

Ivan MIRO PANADES – NOCS 2008 4

slide-5
SLIDE 5

ANOC architecture

Asynchronous NoC

  • Asynchronous send/accept

handshake protocol

  • QDI 4-phase/4-rail

asynchronous logic

  • QoS with two Virtual

Channels (Best Effort, Guaranteed Service)

NoC Routers

  • 5 ports router
  • Source routing
  • Wormhole packet switching
  • 32 bits payload

FIFO based GALS interfaces Mapped onto libraries

  • CMOS ST 130nm
  • TIMA TAL130 library

Crossbar Crossbar Crossbar Crossbar N I C G A L S i n t e r f a c e

IP

Asynchronous handshake protocol Ivan MIRO PANADES – NOCS 2008 5

slide-6
SLIDE 6

FAUST floor-plan with ANOC

4.5 M Gates 276 chip pins Chip area 80 mm² ST CMOS 130nm 166 MHz (worst-case) NoC implementation :

DART

RAC ARM946 RAM1 RAM2 Ext. RAM Ctrl.

OFDM mod. Rotor

Frame sync.

OFDM demod. CDMA Dem. Channel Est. Ethernet Deinter. Demapp. Conv. Dec. Equal. Mapp. CDMA Mod. NP1

NP2

Ala. Bit. Inter. Turbo Dec. Conv. Codec.

Exp. Exp. CLK Reset

  • Uses ANOC router hard-macro
  • Perform buffering and routing
  • f ANOC links
  • No spaghetti routing at top

level !

ANOC router (0.211 mm²)

West Port East Port NorthPort Unit Port South Port

Ivan MIRO PANADES – NOCS 2008 6

slide-7
SLIDE 7

Outline

Motivation FAUST architecture Migration of DSPIN into FAUST DSPIN implementation Networks comparison

Ivan MIRO PANADES – NOCS 2008 7

slide-8
SLIDE 8

Ivan MIRO PANADES – NOCS 2008 8

Cluster (Y,X) Cluster (Y,X) Cluster (Y Cluster (Y-

  • 1,X)

1,X) (Y,X+1) (Y,X+1) (Y,X (Y,X-

  • 1)

1) (Y (Y-

  • 1,X

1,X-

  • 1)

1) (Y (Y-

  • 1,X+1)

1,X+1) Cluster (Y+1,X) Cluster (Y+1,X) (Y+1,X (Y+1,X-

  • 1)

1) (Y+1,X+1) (Y+1,X+1)

Local Local port port

in in in in in in in in in

DSPIN architecture

CK0 CK0 CK1 CK1 CK4 CK4 CK5 CK5 CK3 CK3 CK6 CK6 CK7 CK7 CK8 CK8 CK CK 2 2

Packet base Distributed router architecture Suited to GALS approach Mesochronous links between routers Synthesizable with standard cells Neither asynchronous nor custom cells Metastability resolved by “bi-synchronous FIFO”

More details in: "A Low Cost Network-on-Chip with Guaranteed Service Well Suited to the GALS Approach“ NanoNet’06

slide-9
SLIDE 9

NoC architectures comparison

ANOC DSPIN

Router arity 5 port router 5 port router Topology Irregular 2D mesh Regular 2D mesh Routing technique Source routing (9 hops) Address-based (8 bits) X-First algorithm Switching technique Wormhole Wormhole Flit size (payload) 34 bits (32 bits) Generic: 34 bits (32 bits) Flow control bits on the flit Begin of packet (BOP) End of packet (EOP) Begin of packet (BOP) End of packet (EOP) Virtual channels Best effort Guaranteed service Best effort Guaranteed service Programming model Message passing Shared memory (2 routers per cluster) Message passing (1 router per cluster) Clocking scheme Fully asynchronous (QDI) with GALS interfaces Multi-synchronous with mesochronous interfaces Flow control protocol Send/accept asynchronous handshake FIFO protocol (Write and WriteOk) Clock tree None One per router Physical implementation Hard macro Soft macro distributed on five modules Long wires Inter-router wires Intra-cluster wires

Ivan MIRO PANADES – NOCS 2008 9

slide-10
SLIDE 10

Packet format

Y X 4 bits

H0 H1 H2

...

H8

2 bits 4 bits 2 bits 34 bits 18 bits 34 bits (generic)

First flit Following flits

2 bits 8 bits

DSPIN packet ANOC packet

Similar packet format and control bits ANOC uses Source-routing (18 bits) allowing 9 hops DSPIN uses Address-based (8 bits) Packet conversion module required:

  • Design of Protocol_conversion module

Ivan MIRO PANADES – NOCS 2008 10

slide-11
SLIDE 11

FAUST integration

GALS interface Synchronous SEND/ACCEPT Asynchronous SEND/ACCEPT

IP

NIC

ANOC router

Asynchronous SEND/ACCEPT CLK_IP

ANOC IP template

Protocol_conversion Synchronous READ/WRITE Synchronous SEND/ACCEPT

IP

NIC

LUT

DSPIN router

Mesochronous READ/WRITE CLK_NoC CLK_IP

DSPIN IP template

Protocol_conversion module:

Translates the routing algorithm using a LUT Adapts the flow control signals:

  • ANOC: Send/Accept
  • DSPIN: FIFO protocol

Implements two bi-synchronous FIFOs

ANOC GALS Interface:

Implements 4 FIFOs synchronous↔asynchronous

FAUST IPs and NICs are not modified

Ivan MIRO PANADES – NOCS 2008 11

slide-12
SLIDE 12

Outline

Motivation FAUST architecture Migration of DSPIN into FAUST DSPIN implementation Networks comparison

Ivan MIRO PANADES – NOCS 2008 12

slide-13
SLIDE 13

DSPIN implementation

Ivan MIRO PANADES – NOCS 2008 13

Synthesis

Hierarchical synthesis Low Power CMOS ST 130nm technology Standard cells Without asynchronous nor custom cells

Floorplanning Place Optimize placement

Timing constraints file:

  • Muti-cycle path (mesochronous interfaces)
  • False path (asynchronous interfaces)

Clock-tree Route

GALS compatible Clock gating Implemented in 4 steps

Optimize

slide-14
SLIDE 14

DSPIN clock-tree

Mesochronous links GALS compatible Bi-synchronous FIFO [NOCS 2007] Max skew 50% clock period Timing constraints: set_multi_cycle_path Clk Clk’ Clk Clk’

Ivan MIRO PANADES – NOCS 2008 14

Router (0,0) Router (0,1) Router (0,2)

180° phase shift and 30% skew between routers 5% skew within the router 2nd, 3th Step 5% skew ( bottom tree) 1st Step 4th Step 30% skew (top tree) Clk_NoC

Clock-tree implementation 1. Add buffers/inverters 2. Built bottom clock tree (5% skew) 3. Characterize bottom clock-tree 4. Build top clock-tree (30% skew)

slide-15
SLIDE 15

FAUST floor-plan with DSPIN

Distributed router implementation Soft macro approach Higher floor-plan flexibility NoC adapts to the SoC Long wires are routed in a tree manner Different router configurations are possible

W E S L N L E S W W N L S E N W L S E N W L S E N L W S E N W E E N W L S E N W L S E N W L S E L N W S E N W L E S N W L S E W L N S E N E W S L N W S L E S W E L N W S L E N W S L E N S W L E N N L

RAC

OFDM mod. CDMA Mod.

NP1 Ala.

  • Bit. Inter.

Turbo Dec. Conv. Codec.

CLK

ARM946 RAM1 RAM2

Ext. RAM Ctrl.

Rotor Channel Est. Ethernet Conv. Dec. Equal.

Frame sync.

OFDM demod. CDMA Dem. Deinter. Demapp.

DART

N W L S E

DSPIN router

Placement density: 60-70 %

(reserved area)

Ivan MIRO PANADES – NOCS 2008 15

slide-16
SLIDE 16

FAUST floor-plan

Mapp.

NP2

NP1

NP2 Exp. Exp. CLK Reset

FAUST with ANOC FAUST with DSPIN

Ivan MIRO PANADES – NOCS 2008 16

slide-17
SLIDE 17

Outline

Motivation FAUST architecture Migration of DSPIN into FAUST DSPIN implementation Networks comparison

Ivan MIRO PANADES – NOCS 2008 17

slide-18
SLIDE 18

NoC Area

ANOC DSPIN Router 0.211 mm² 0.161 mm² Interface GALS 0.070 mm² 0.024 mm² Clock tree 0.000 mm² 0.0016 mm² Total 0.281 mm² 0.187 mm²

CMOS 130nm

ANOC is implemented as a hard macro DSPIN is implemented as a soft macro DSPIN is 33% smaller than ANOC

Ivan MIRO PANADES – NOCS 2008 18

slide-19
SLIDE 19

NoC Throughput

ANOC DSPIN Throughput on worst-case conditions ~ 160Mflit/s ≤ 289Mflit/s Throughput on nominal conditions ~ 220Mflit/s ≤ 408Mflit/s

DSPIN throughput is deterministic with respect to the clock frequency (one flit per clock cycle) Long wire latency penalty on throughput:

  • DSPIN: critical path crosses one time the long wires
  • ANOC: critical path crosses 4 times the long wires, 4-phase protocol
  • ANOC link pipelining is feasible

In a commercial circuit, DSPIN will be clocked not far away from worst-case (289 MHz) to improve the fabrication yield

Ivan MIRO PANADES – NOCS 2008 19

slide-20
SLIDE 20

Packet latency (1)

Compute the packet latency through many routers DSPIN has deterministic packet latency with respect to clock frequency DSPIN has lower First and Last packet latencies ANOC intermediate latency is lower than DSPIN. DSPIN resynchronize the data on each router

ANOC DSPIN F = 150 MHz Intermediate router latency 6.80 ns 16.66 ns First + Last latency 60.00 ns 56.66 ns ANOC DSPIN F = 250 MHz 6.80 ns 10.00 ns 47.00 ns 34.00 ns

Ivan MIRO PANADES – NOCS 2008 20

slide-21
SLIDE 21

Packet latency (2)

ANOC DSPIN ANOC DSPIN F = 150 MHz F = 250 MHz Latency for 5 hops path 80.00 ns 106.66 ns 68.00 ns 64.00 ns Latency for 9 hops path 106.66 ns 173.30 ns 96.00 ns 104.00 ns

The packet latency are similar for clock frequencies >250 MHz Intermediate packet latency is important but the application should be mapped in order to optimize the data locality (try to communicate with neighbor IPs rather than with faraway IPs)

Ivan MIRO PANADES – NOCS 2008 21

slide-22
SLIDE 22

Power consumption

DSPIN ANOC F = 150 MHz F = 250 MHz Router 2.07 mW 2.89 mW 4.85 mW GALS interface 1.62 mW 0.56 mW 0.81 mW Clock tree 0.00 mW 2.44 mW 4.73 mW Total 3.69 mW 5.89 mW 10.39 mW

Power extraction after P&R Functional packet traffic (OFDM demodulation) Power consumption majorly dominated by FIFO data registers The DSPIN clock-gating reduced the power consumption by 67% DSPIN clock-tree consumes as much power as the router itself

  • Needs to improve DSPIN clock-gating
  • GALS clock-tree consumes only 2.5% of total clock-tree power

Ivan MIRO PANADES – NOCS 2008 22

slide-23
SLIDE 23

Conclusion

Physical implementation of the DSPIN Network-on-Chip on FAUST platform

  • Comparison between ANOC and DSPIN at architecture level
  • Adaptation of DSPIN architecture to manage stream-oriented

communications

  • Implementation up to layout of DSPIN network on FAUST platform

=> DSPIN mesochronous links fully implemented with standard tools

Comparison between DSPIN and ANOC NoCs (STMicroelectronics 130nm)

  • Area of DSPIN is 33% smaller than ANOC one
  • Maximum sustained throughput of DSPIN is 31% higher than ANOC
  • ANOC has lower packet latency
  • DSPIN power consumption 1.5 to 3 times higher than ANOC

ANOC is a good candidate for low latency and low power applications, while DSPIN is more suited to low area and high performance applications

Ivan MIRO PANADES – NOCS 2008 23

slide-24
SLIDE 24

See you at my PhD defense on May 20th

Ivan MIRO PANADES – NOCS 2008 24