Lecture 2 (I ): Lecture 2 (I ): Pipelining & Retiming - - PowerPoint PPT Presentation
Lecture 2 (I ): Lecture 2 (I ): Pipelining & Retiming - - PowerPoint PPT Presentation
Lecture 2 (I ): Lecture 2 (I ): Pipelining & Retiming Pipelining & Retiming Hsie-Chia Chang E-mail : hcchang@mail.nctu.edu.tw Fall 2006 Outline Outline Pipelining of FI R Digital filters Data-Broadcast Structures
2
Optimized Application-Specific I ntegrated Systems
Outline Outline
Pipelining of FI R Digital filters
– Data-Broadcast Structures – Fine-Grain Pipelining
Parallel Processing Pipelining and Parallel Processing for Low Power Retiming
– Definitions and Properties – Solving Systems of Inequalities – Retiming Techniques
- Cutset Retiming & Pipelining
- Retiming for Clock Period Minimization
- Retiming for Register Minimization
3
Optimized Application-Specific I ntegrated Systems
I ntroduction I ntroduction
– If some real-time application requires a faster input rate, the critical path can be reduced by either pipelining or parallel processing
4
Optimized Application-Specific I ntegrated Systems
Pipelining & Parallel Processing (1/ 2) Pipelining & Parallel Processing (1/ 2)
Pipelining
– Reduce the effective critical path by introducing pipelining
latches along the critical datapath
– Without any pipelining latches, the critical path can be reduced by
Parallel processing
– Increase the sampling by replicating hardware so that inputs can be processed in parallel; outputs can be produced at the same time
This techniques applied in the non-recursive computations
continue sending Tsample= TCLK Tsample≠TCLK
5
Optimized Application-Specific I ntegrated Systems
Pipelining & Parallel Processing (2/ 2) Pipelining & Parallel Processing (2/ 2)
Example 2:
6
Optimized Application-Specific I ntegrated Systems
Pipelining of FI R Digital Filters Pipelining of FI R Digital Filters
Schedule of Events in the Pipelined FIR Filter
TCritical= TM+ TA
7
Optimized Application-Specific I ntegrated Systems
Cutset Cutset Pipelining (1/ 2) Pipelining (1/ 2)
The speed is limited by the longest path between
– any two latches – an input & a latch – a latch & an output – The input & the output
2-level pipelined structure
– The longest path can be reduced by suitably placing the pipelining latches in the architecture – In this system, at any time, 2 consecutive outputs are computed in an interleaved manner – Drawbacks
8
Optimized Application-Specific I ntegrated Systems
Cutset Cutset Pipelining (2/ 2) Pipelining (2/ 2)
Cutset Feed-forward cutset
– We can arbitrarily place latches on a feed-forward cutset of any FIR filter structure without affecting the functionality of the algorithm
+ kD +kD + kD
cutset
G2 G1
9
Optimized Application-Specific I ntegrated Systems
Example 3.2.1 Example 3.2.1
10
Optimized Application-Specific I ntegrated Systems
Data Data-
- Broadcast Structures
Broadcast Structures
11
Optimized Application-Specific I ntegrated Systems
Fine Fine-
- grain Pipelining
grain Pipelining
12
Optimized Application-Specific I ntegrated Systems
Parallel Processing Parallel Processing
Parallel processing are also referred to as block processing
– Block size = no. of inputs processed in a clock cycle – For a 3-tap FRI filter, the duplicate hardware can be shown as:
I n MI MO,
) 2 ( ) 1 ( ) ( ) ( − + − + = n cx n bx n ax n y
+ + + + = + − + + + = + − + − + = ) 3 ( ) 1 3 ( ) 2 3 ( ) 2 3 ( ) 1 3 ( ) 3 ( ) 1 3 ( ) 1 3 ( ) 2 3 ( ) 1 3 ( ) 3 ( ) 3 ( k cx k bx k ax k y k cx k bx k ax k y k cx k bx k ax k y
delay Block delay
13
Optimized Application-Specific I ntegrated Systems
Complete Parallel Processing Systems Complete Parallel Processing Systems
– A serial-to-parallel converter – A parallel-to-serial converter
14
Optimized Application-Specific I ntegrated Systems
Why use Parallel Processing?? Why use Parallel Processing??
Communication bounded
– When the critical path is less than Tcommunication, the I/O bound dominates and this system is communication bounded. – Pipelining can be used only to the extent such that the critical path is limited by the communication bound. – Once this is reached, pipelining can no longer increase the speed
15
Optimized Application-Specific I ntegrated Systems
Combined Pipelining & Parallel Processing Combined Pipelining & Parallel Processing
– After combining M-level pipelining and L-level parallel processing,
16
Optimized Application-Specific I ntegrated Systems
CMOS Power Consumption (1/ 2) CMOS Power Consumption (1/ 2)
Ptotal= Pdynamic+ Pshort-circuit+ Pstatic Short circuit
– current spikes
Static Power
– leakage current
17
Optimized Application-Specific I ntegrated Systems
CMOS Power Consumption (2/ 2) CMOS Power Consumption (2/ 2)
Based on simple approximation & 1st-order analysis
– Propagation delay Ccharge the capacitance to be charged or discharged in a single clock cycle (along the critical path) V0、Vt the supply voltage、the threshold voltage K a function of technology parameters – Power consumption Ctotal the total capacitance of the CMOS circuit f clock frequency of the circuit
f V C P
total
⋅ ⋅ =
2
( )
2 charge pd t
V V k V C T − ⋅ =
18
Optimized Application-Specific I ntegrated Systems
Low Power Design Low Power Design
To reduce
– Capacitances
- Transistor/Gate C
- Load C
- Interconnects
- External
– Activity – Frequency – Power supply
Other issues
– Off-chip connections have high capacitive load – System integration
19
Optimized Application-Specific I ntegrated Systems
Pipelining for Low Power (1/ 2) Pipelining for Low Power (1/ 2)
For an M-level pipelined architecture,
– the critical path is reduced to 1/ M and the capacitance to be charged/discharged in a single cycle (Ccharge) is also reduced to 1/ M
I f the same clock speed is maintained (f = 1/ Tpd),
– only 1/M of the non-pipelined capacitance is required to be charged
- r discharged, which suggests voltage reduction
– Suppose the voltage can be reduced to , the power consumption becomes
V ⋅ β
( )
pipelined non total pipelined
P f V C P
−
⋅ = ⋅ ⋅ ⋅ =
2 2
β β
20
Optimized Application-Specific I ntegrated Systems
Pipelining for Low Power (2/ 2) Pipelining for Low Power (2/ 2)
– propagation delay of the original architecture – propagation delay of the pipelined architecture – setting the above two equations equal, the following quadratic equation can be obtained to solve β
( ) ( )
2 2 t t
V V V V M − ⋅ = − ⋅ β β
21
Optimized Application-Specific I ntegrated Systems
Example 3.4.1: Reduce Power by Pipelining Example 3.4.1: Reduce Power by Pipelining
Consider the following two FI R filters.
– What is the supply voltage of the pipelined architecture if the clock periods are identical? – What is the relative power consumption?
D y(n) D x(n)
D y(n) D x(n) D D D
m 1 m 2 m 1 m 1 m 2 m 2
22
Optimized Application-Specific I ntegrated Systems
Solution Solution
23
Optimized Application-Specific I ntegrated Systems
Parallel Processing for Low Power (1/ 2) Parallel Processing for Low Power (1/ 2)
For an L-parallel architecture,
– the charge capacitance remains the same, but the total capacitance (Ctotal) is increased L times
To maintain the same sample rate,
– The clock speed is reduced to 1/L (f = 1/LTpd), which means the Ccharge is charged or discharged L times longer. – The supply voltage can be reduced to , the power consumption becomes
V ⋅ β
( ) ( )
parallel non total parallel
P L f V C L P
−
⋅ = ⋅ ⋅ ⋅ ⋅ =
2 2
β β
24
Optimized Application-Specific I ntegrated Systems
Parallel Processing for Low Power (2/ 2) Parallel Processing for Low Power (2/ 2)
– propagation delay of the original architecture – propagation delay of the parallel architecture – setting these two propagation delays equal, the following quadratic equation can be obtained to solve β
( ) ( )
2 2 t t
V V V V L − ⋅ = − ⋅ β β
25
Optimized Application-Specific I ntegrated Systems
Example 3.4.2: Reduce Power by Parallel Example 3.4.2: Reduce Power by Parallel
Consider the following two FI R filters, with critical paths denoted in dash lines respectively
– What is the supply voltage of the parallel architecture? – What is the relative power consumption?
D y(n) D x(n) D D y(2k+1) x(2k) y(2k) D D x(2k+1)
26
Optimized Application-Specific I ntegrated Systems
Solution Solution
27
Optimized Application-Specific I ntegrated Systems
Example 3.4.3 Example 3.4.3
Area-efficient architecture
28
Optimized Application-Specific I ntegrated Systems