[PPT] - Lecture 2 (I ): Lecture 2 (I ): Pipelining & Retiming PowerPoint Presentation

SLIDE 1

Lecture 2 (I ): Lecture 2 (I ):

Pipelining & Retiming Pipelining & Retiming

Hsie-Chia Chang 張錫嘉 E-mail : hcchang@mail.nctu.edu.tw

Fall 2006

SLIDE 2

2

Optimized Application-Specific I ntegrated Systems

Outline Outline

Pipelining of FI R Digital filters

– Data-Broadcast Structures – Fine-Grain Pipelining

Parallel Processing Pipelining and Parallel Processing for Low Power Retiming

– Definitions and Properties – Solving Systems of Inequalities – Retiming Techniques

Cutset Retiming & Pipelining
Retiming for Clock Period Minimization
Retiming for Register Minimization

SLIDE 3

3

Optimized Application-Specific I ntegrated Systems

I ntroduction I ntroduction

– If some real-time application requires a faster input rate, the critical path can be reduced by either pipelining or parallel processing

SLIDE 4

4

Optimized Application-Specific I ntegrated Systems

Pipelining & Parallel Processing (1/ 2) Pipelining & Parallel Processing (1/ 2)

Pipelining

– Reduce the effective critical path by introducing pipelining

latches along the critical datapath

– Without any pipelining latches, the critical path can be reduced by

Parallel processing

– Increase the sampling by replicating hardware so that inputs can be processed in parallel; outputs can be produced at the same time

This techniques applied in the non-recursive computations

continue sending Tsample= TCLK Tsample≠TCLK

SLIDE 5

5

Optimized Application-Specific I ntegrated Systems

Pipelining & Parallel Processing (2/ 2) Pipelining & Parallel Processing (2/ 2)

Example 2:

SLIDE 6

6

Optimized Application-Specific I ntegrated Systems

Pipelining of FI R Digital Filters Pipelining of FI R Digital Filters

Schedule of Events in the Pipelined FIR Filter

TCritical= TM+ TA

SLIDE 7

7

Optimized Application-Specific I ntegrated Systems

Cutset Cutset Pipelining (1/ 2) Pipelining (1/ 2)

The speed is limited by the longest path between

– any two latches – an input & a latch – a latch & an output – The input & the output

2-level pipelined structure

– The longest path can be reduced by suitably placing the pipelining latches in the architecture – In this system, at any time, 2 consecutive outputs are computed in an interleaved manner – Drawbacks

SLIDE 8

8

Optimized Application-Specific I ntegrated Systems

Cutset Cutset Pipelining (2/ 2) Pipelining (2/ 2)

Cutset Feed-forward cutset

– We can arbitrarily place latches on a feed-forward cutset of any FIR filter structure without affecting the functionality of the algorithm

+ kD +kD + kD

cutset

G2 G1

SLIDE 9

9

Optimized Application-Specific I ntegrated Systems

Example 3.2.1 Example 3.2.1

SLIDE 10

10

Optimized Application-Specific I ntegrated Systems

Data Data-

Broadcast Structures

Broadcast Structures

SLIDE 11

11

Optimized Application-Specific I ntegrated Systems

Fine Fine-

grain Pipelining

grain Pipelining

SLIDE 12

12

Optimized Application-Specific I ntegrated Systems

Parallel Processing Parallel Processing

Parallel processing are also referred to as block processing

– Block size = no. of inputs processed in a clock cycle – For a 3-tap FRI filter, the duplicate hardware can be shown as:

I n MI MO,

) 2 ( ) 1 ( ) ( ) ( − + − + = n cx n bx n ax n y

     + + + + = + − + + + = + − + − + = ) 3 ( ) 1 3 ( ) 2 3 ( ) 2 3 ( ) 1 3 ( ) 3 ( ) 1 3 ( ) 1 3 ( ) 2 3 ( ) 1 3 ( ) 3 ( ) 3 ( k cx k bx k ax k y k cx k bx k ax k y k cx k bx k ax k y

delay Block delay

SLIDE 13

13

Optimized Application-Specific I ntegrated Systems

Complete Parallel Processing Systems Complete Parallel Processing Systems

– A serial-to-parallel converter – A parallel-to-serial converter

SLIDE 14

14

Optimized Application-Specific I ntegrated Systems

Why use Parallel Processing?? Why use Parallel Processing??

Communication bounded

– When the critical path is less than Tcommunication, the I/O bound dominates and this system is communication bounded. – Pipelining can be used only to the extent such that the critical path is limited by the communication bound. – Once this is reached, pipelining can no longer increase the speed

SLIDE 15

15

Optimized Application-Specific I ntegrated Systems

Combined Pipelining & Parallel Processing Combined Pipelining & Parallel Processing

– After combining M-level pipelining and L-level parallel processing,

SLIDE 16

16

Optimized Application-Specific I ntegrated Systems

CMOS Power Consumption (1/ 2) CMOS Power Consumption (1/ 2)

Ptotal= Pdynamic+ Pshort-circuit+ Pstatic Short circuit

– current spikes

Static Power

– leakage current

SLIDE 17

17

Optimized Application-Specific I ntegrated Systems

CMOS Power Consumption (2/ 2) CMOS Power Consumption (2/ 2)

Based on simple approximation & 1st-order analysis

– Propagation delay Ccharge the capacitance to be charged or discharged in a single clock cycle (along the critical path) V0、Vt the supply voltage、the threshold voltage K a function of technology parameters – Power consumption Ctotal the total capacitance of the CMOS circuit f clock frequency of the circuit

f V C P

total

⋅ ⋅ =

2

( )

2 charge pd t

V V k V C T − ⋅ =

SLIDE 18

18

Optimized Application-Specific I ntegrated Systems

Low Power Design Low Power Design

To reduce

– Capacitances

Transistor/Gate C
Load C
Interconnects
External

– Activity – Frequency – Power supply

Other issues

– Off-chip connections have high capacitive load – System integration

SLIDE 19

19

Optimized Application-Specific I ntegrated Systems

Pipelining for Low Power (1/ 2) Pipelining for Low Power (1/ 2)

For an M-level pipelined architecture,

– the critical path is reduced to 1/ M and the capacitance to be charged/discharged in a single cycle (Ccharge) is also reduced to 1/ M

I f the same clock speed is maintained (f = 1/ Tpd),

– only 1/M of the non-pipelined capacitance is required to be charged

r discharged, which suggests voltage reduction

– Suppose the voltage can be reduced to , the power consumption becomes

V ⋅ β

( )

pipelined non total pipelined

P f V C P

−

⋅ = ⋅ ⋅ ⋅ =

2 2

β β

SLIDE 20

20

Optimized Application-Specific I ntegrated Systems

Pipelining for Low Power (2/ 2) Pipelining for Low Power (2/ 2)

– propagation delay of the original architecture – propagation delay of the pipelined architecture – setting the above two equations equal, the following quadratic equation can be obtained to solve β

( ) ( )

2 2 t t

V V V V M − ⋅ = − ⋅ β β

SLIDE 21

21

Optimized Application-Specific I ntegrated Systems

Example 3.4.1: Reduce Power by Pipelining Example 3.4.1: Reduce Power by Pipelining

Consider the following two FI R filters.

– What is the supply voltage of the pipelined architecture if the clock periods are identical? – What is the relative power consumption?

D y(n) D x(n)

D y(n) D x(n) D D D

m 1 m 2 m 1 m 1 m 2 m 2

SLIDE 22

22

Optimized Application-Specific I ntegrated Systems

Solution Solution

SLIDE 23

23

Optimized Application-Specific I ntegrated Systems

Parallel Processing for Low Power (1/ 2) Parallel Processing for Low Power (1/ 2)

For an L-parallel architecture,

– the charge capacitance remains the same, but the total capacitance (Ctotal) is increased L times

To maintain the same sample rate,

– The clock speed is reduced to 1/L (f = 1/LTpd), which means the Ccharge is charged or discharged L times longer. – The supply voltage can be reduced to , the power consumption becomes

V ⋅ β

( ) ( )

parallel non total parallel

P L f V C L P

−

⋅ = ⋅ ⋅ ⋅ ⋅ =

2 2

β β

SLIDE 24

24

Optimized Application-Specific I ntegrated Systems

Parallel Processing for Low Power (2/ 2) Parallel Processing for Low Power (2/ 2)

– propagation delay of the original architecture – propagation delay of the parallel architecture – setting these two propagation delays equal, the following quadratic equation can be obtained to solve β

( ) ( )

2 2 t t

V V V V L − ⋅ = − ⋅ β β

SLIDE 25

25

Optimized Application-Specific I ntegrated Systems

Example 3.4.2: Reduce Power by Parallel Example 3.4.2: Reduce Power by Parallel

Consider the following two FI R filters, with critical paths denoted in dash lines respectively

– What is the supply voltage of the parallel architecture? – What is the relative power consumption?

D y(n) D x(n) D D y(2k+1) x(2k) y(2k) D D x(2k+1)

SLIDE 26

26

Optimized Application-Specific I ntegrated Systems

Solution Solution

SLIDE 27

27

Optimized Application-Specific I ntegrated Systems

Example 3.4.3 Example 3.4.3

Area-efficient architecture

SLIDE 28

28

Optimized Application-Specific I ntegrated Systems