Lecture 2 (I ): Lecture 2 (I ): Pipelining & Retiming - - PowerPoint PPT Presentation

lecture 2 i lecture 2 i
SMART_READER_LITE
LIVE PREVIEW

Lecture 2 (I ): Lecture 2 (I ): Pipelining & Retiming - - PowerPoint PPT Presentation

Lecture 2 (I ): Lecture 2 (I ): Pipelining & Retiming Pipelining & Retiming Hsie-Chia Chang E-mail : hcchang@mail.nctu.edu.tw Fall 2006 Outline Outline Pipelining of FI R Digital filters Data-Broadcast Structures


slide-1
SLIDE 1

Lecture 2 (I ): Lecture 2 (I ):

Pipelining & Retiming Pipelining & Retiming

Hsie-Chia Chang 張錫嘉 E-mail : hcchang@mail.nctu.edu.tw

Fall 2006

slide-2
SLIDE 2

2

Optimized Application-Specific I ntegrated Systems

Outline Outline

Pipelining of FI R Digital filters

– Data-Broadcast Structures – Fine-Grain Pipelining

Parallel Processing Pipelining and Parallel Processing for Low Power Retiming

– Definitions and Properties – Solving Systems of Inequalities – Retiming Techniques

  • Cutset Retiming & Pipelining
  • Retiming for Clock Period Minimization
  • Retiming for Register Minimization
slide-3
SLIDE 3

3

Optimized Application-Specific I ntegrated Systems

I ntroduction I ntroduction

– If some real-time application requires a faster input rate, the critical path can be reduced by either pipelining or parallel processing

slide-4
SLIDE 4

4

Optimized Application-Specific I ntegrated Systems

Pipelining & Parallel Processing (1/ 2) Pipelining & Parallel Processing (1/ 2)

Pipelining

– Reduce the effective critical path by introducing pipelining

latches along the critical datapath

– Without any pipelining latches, the critical path can be reduced by

Parallel processing

– Increase the sampling by replicating hardware so that inputs can be processed in parallel; outputs can be produced at the same time

This techniques applied in the non-recursive computations

continue sending Tsample= TCLK Tsample≠TCLK

slide-5
SLIDE 5

5

Optimized Application-Specific I ntegrated Systems

Pipelining & Parallel Processing (2/ 2) Pipelining & Parallel Processing (2/ 2)

Example 2:

slide-6
SLIDE 6

6

Optimized Application-Specific I ntegrated Systems

Pipelining of FI R Digital Filters Pipelining of FI R Digital Filters

Schedule of Events in the Pipelined FIR Filter

TCritical= TM+ TA

slide-7
SLIDE 7

7

Optimized Application-Specific I ntegrated Systems

Cutset Cutset Pipelining (1/ 2) Pipelining (1/ 2)

The speed is limited by the longest path between

– any two latches – an input & a latch – a latch & an output – The input & the output

2-level pipelined structure

– The longest path can be reduced by suitably placing the pipelining latches in the architecture – In this system, at any time, 2 consecutive outputs are computed in an interleaved manner – Drawbacks

slide-8
SLIDE 8

8

Optimized Application-Specific I ntegrated Systems

Cutset Cutset Pipelining (2/ 2) Pipelining (2/ 2)

Cutset Feed-forward cutset

– We can arbitrarily place latches on a feed-forward cutset of any FIR filter structure without affecting the functionality of the algorithm

+ kD +kD + kD

cutset

G2 G1

slide-9
SLIDE 9

9

Optimized Application-Specific I ntegrated Systems

Example 3.2.1 Example 3.2.1

slide-10
SLIDE 10

10

Optimized Application-Specific I ntegrated Systems

Data Data-

  • Broadcast Structures

Broadcast Structures

slide-11
SLIDE 11

11

Optimized Application-Specific I ntegrated Systems

Fine Fine-

  • grain Pipelining

grain Pipelining

slide-12
SLIDE 12

12

Optimized Application-Specific I ntegrated Systems

Parallel Processing Parallel Processing

Parallel processing are also referred to as block processing

– Block size = no. of inputs processed in a clock cycle – For a 3-tap FRI filter, the duplicate hardware can be shown as:

I n MI MO,

) 2 ( ) 1 ( ) ( ) ( − + − + = n cx n bx n ax n y

     + + + + = + − + + + = + − + − + = ) 3 ( ) 1 3 ( ) 2 3 ( ) 2 3 ( ) 1 3 ( ) 3 ( ) 1 3 ( ) 1 3 ( ) 2 3 ( ) 1 3 ( ) 3 ( ) 3 ( k cx k bx k ax k y k cx k bx k ax k y k cx k bx k ax k y

delay Block delay

slide-13
SLIDE 13

13

Optimized Application-Specific I ntegrated Systems

Complete Parallel Processing Systems Complete Parallel Processing Systems

– A serial-to-parallel converter – A parallel-to-serial converter

slide-14
SLIDE 14

14

Optimized Application-Specific I ntegrated Systems

Why use Parallel Processing?? Why use Parallel Processing??

Communication bounded

– When the critical path is less than Tcommunication, the I/O bound dominates and this system is communication bounded. – Pipelining can be used only to the extent such that the critical path is limited by the communication bound. – Once this is reached, pipelining can no longer increase the speed

slide-15
SLIDE 15

15

Optimized Application-Specific I ntegrated Systems

Combined Pipelining & Parallel Processing Combined Pipelining & Parallel Processing

– After combining M-level pipelining and L-level parallel processing,

slide-16
SLIDE 16

16

Optimized Application-Specific I ntegrated Systems

CMOS Power Consumption (1/ 2) CMOS Power Consumption (1/ 2)

Ptotal= Pdynamic+ Pshort-circuit+ Pstatic Short circuit

– current spikes

Static Power

– leakage current

slide-17
SLIDE 17

17

Optimized Application-Specific I ntegrated Systems

CMOS Power Consumption (2/ 2) CMOS Power Consumption (2/ 2)

Based on simple approximation & 1st-order analysis

– Propagation delay Ccharge the capacitance to be charged or discharged in a single clock cycle (along the critical path) V0、Vt the supply voltage、the threshold voltage K a function of technology parameters – Power consumption Ctotal the total capacitance of the CMOS circuit f clock frequency of the circuit

f V C P

total

⋅ ⋅ =

2

( )

2 charge pd t

V V k V C T − ⋅ =

slide-18
SLIDE 18

18

Optimized Application-Specific I ntegrated Systems

Low Power Design Low Power Design

To reduce

– Capacitances

  • Transistor/Gate C
  • Load C
  • Interconnects
  • External

– Activity – Frequency – Power supply

Other issues

– Off-chip connections have high capacitive load – System integration

slide-19
SLIDE 19

19

Optimized Application-Specific I ntegrated Systems

Pipelining for Low Power (1/ 2) Pipelining for Low Power (1/ 2)

For an M-level pipelined architecture,

– the critical path is reduced to 1/ M and the capacitance to be charged/discharged in a single cycle (Ccharge) is also reduced to 1/ M

I f the same clock speed is maintained (f = 1/ Tpd),

– only 1/M of the non-pipelined capacitance is required to be charged

  • r discharged, which suggests voltage reduction

– Suppose the voltage can be reduced to , the power consumption becomes

V ⋅ β

( )

pipelined non total pipelined

P f V C P

⋅ = ⋅ ⋅ ⋅ =

2 2

β β

slide-20
SLIDE 20

20

Optimized Application-Specific I ntegrated Systems

Pipelining for Low Power (2/ 2) Pipelining for Low Power (2/ 2)

– propagation delay of the original architecture – propagation delay of the pipelined architecture – setting the above two equations equal, the following quadratic equation can be obtained to solve β

( ) ( )

2 2 t t

V V V V M − ⋅ = − ⋅ β β

slide-21
SLIDE 21

21

Optimized Application-Specific I ntegrated Systems

Example 3.4.1: Reduce Power by Pipelining Example 3.4.1: Reduce Power by Pipelining

Consider the following two FI R filters.

– What is the supply voltage of the pipelined architecture if the clock periods are identical? – What is the relative power consumption?

D y(n) D x(n)

D y(n) D x(n) D D D

m 1 m 2 m 1 m 1 m 2 m 2

slide-22
SLIDE 22

22

Optimized Application-Specific I ntegrated Systems

Solution Solution

slide-23
SLIDE 23

23

Optimized Application-Specific I ntegrated Systems

Parallel Processing for Low Power (1/ 2) Parallel Processing for Low Power (1/ 2)

For an L-parallel architecture,

– the charge capacitance remains the same, but the total capacitance (Ctotal) is increased L times

To maintain the same sample rate,

– The clock speed is reduced to 1/L (f = 1/LTpd), which means the Ccharge is charged or discharged L times longer. – The supply voltage can be reduced to , the power consumption becomes

V ⋅ β

( ) ( )

parallel non total parallel

P L f V C L P

⋅ = ⋅ ⋅ ⋅ ⋅ =

2 2

β β

slide-24
SLIDE 24

24

Optimized Application-Specific I ntegrated Systems

Parallel Processing for Low Power (2/ 2) Parallel Processing for Low Power (2/ 2)

– propagation delay of the original architecture – propagation delay of the parallel architecture – setting these two propagation delays equal, the following quadratic equation can be obtained to solve β

( ) ( )

2 2 t t

V V V V L − ⋅ = − ⋅ β β

slide-25
SLIDE 25

25

Optimized Application-Specific I ntegrated Systems

Example 3.4.2: Reduce Power by Parallel Example 3.4.2: Reduce Power by Parallel

Consider the following two FI R filters, with critical paths denoted in dash lines respectively

– What is the supply voltage of the parallel architecture? – What is the relative power consumption?

D y(n) D x(n) D D y(2k+1) x(2k) y(2k) D D x(2k+1)

slide-26
SLIDE 26

26

Optimized Application-Specific I ntegrated Systems

Solution Solution

slide-27
SLIDE 27

27

Optimized Application-Specific I ntegrated Systems

Example 3.4.3 Example 3.4.3

Area-efficient architecture

slide-28
SLIDE 28

28

Optimized Application-Specific I ntegrated Systems

Summary Summary

I n pipelining & parallel processing,

– M-level pipelining, – L-level parallel processing, – Combining M-level pipelining & L-level parallel processing,

For low power design,

– Pipelining – Parallel Processing – Combining Pipelining and Parallel Processing