[PPT] - CDA 4253 FPGA System Design Op7miza7on Techniques Hao Zheng Comp S PowerPoint Presentation

SLIDE 1

CDA 4253 FPGA System Design Op7miza7on Techniques

1

Hao Zheng Comp S ci & Eng Univ of South Florida

SLIDE 2

2

Extracted from Advanced FPGA Design by Steve Kilts

SLIDE 3

3

Op7miza7on for Performance

SLIDE 4

4

Performance Defini7ons

Throughput: the number of inputs processed per unit

2me.

Latency: the amount of 2me for an input to be processed.
Maximizing throughput and minimizing latency in conflict.
Both require 2ming op2miza2on:
Reduce delay of the cri$cal path.

SLIDE 5

5

Achieving High Throughput: Pipelining

Divide data processing into stages
Process different data inputs in different stages

simultaneously.

xpower = 1; for for (i = 0; i < 3; i++) xpower = x * xpower;

- Non-pipelined version
- Non-pipelined version

process process (clk) begin begin if if rising_edge(clk) then then if if start=‘1’ then then cnt <= 3; end end if if; if if cnt > 0 then then cnt <= cnt – 1; xpower <= xpower * x; elsif elsif cnt = 0 then then done <= ‘1’; end if end if; end process end process;

Throughput: 1 data / 3 cycles = 0.33 data / cycle . Latency: 3 cycles. Critical path delay: 1 multiplier delay

SLIDE 6

6

Achieving High Throughput: Pipelining

xpower = 1; for (i = 0; i < 3; i++) xpower = x * xpower;

- Pipelined version

process (clk, rst) begin if rising_edge(clk) then if start=‘1’ then -- stage 1 x1 <= x; xpower1 <= x; done1 <= start; end if;

- stage 2

x2 <= x1; xpower2 <= xpower1 * x1; done2 <= done1;

- stage 3

xpower <= xpower2 * x2; done <= done2; end if; end process;

Throughput: 1 data / cycle Latency: 3 cycles + register delays. Critical path delay: 1 multiplier delay

SLIDE 7

Comparison

7

Iterative implementation Pipelined implementation

SLIDE 8

8

Achieving High Throughput: Pipelining

C

Loop unrolling

X Y

Reg

SLIDE 9

9

Achieving High Throughput: Pipelining

C0

Loop unrolling

X Y

Reg

C1

Reg

Cn ...

SLIDE 10

10

Achieving High Throughput: Pipelining

Divide data processing into stages
Process different data inputs in different stages

simultaneously. din dout

SLIDE 11

11

Achieving High Throughput: Pipelining

Divide data processing into stages
Process different data inputs in different stages

simultaneously. din dout

…

stage 1 stage 2 stage n

Penalty: increase in area as logic needs to be duplicated for different stages

registers

SLIDE 12

12

Reducing Latency

Closely related to reducing cri2cal path delay.
Reducing pipeline registers reduces latency.

din dout

…

stage 1 stage 2 stage n registers

SLIDE 13

13

Reducing Latency

Closely related to reducing cri2cal path delay.
Reducing pipeline registers reduces latency.

din dout

…

stage 1 stage 2 stage n

SLIDE 14

14

Timing Op7miza7on

Maximal clock frequency determined by the longest path

delay in any combina2onal logic blocks.

Pipelining is one approach.

din dout

…

stage 1 stage 2 stage n pipeline registers

din dout

SLIDE 15

15

Timing Op7miza7on: Spa7al Compu7ng

Extract independent opera2ons
Execute independent opera2ons in parallel.

X = A + B + C + D

process (clk, rst) begin if rising_edge(clk) then X1 := A + B; X2 := X1 + C; X <= X2 + D; end if; end process; process (clk, rst) begin if rising_edge(clk) then X1 := A + B; X2 := C + D; X <= X1 + X2; end if; end process;

Critical path delay: 3 adders Critical path delay: 2 adders

SLIDE 16

16

Timing Op7miza7on: Avoid Unwanted Priority

process (clk, rst) begin if rising_edge(clk) then if c[0]=‘1’ then r[0] <= din; elsif c[1]=‘1’ then r[1] <= din; elsif c[2]=‘1’ then r[2] <= din; elsif c[3]=‘1’ then r[3] <= din; end if; end if; end process; Critical path delay: 3-input AND gate + 4x1 MUX.

SLIDE 17

17

Timing Op7miza7on: Avoid Unwanted Priority

Critical path delay: 3-input AND gate + 4x1 MUX.

SLIDE 18

18

Timing Op7miza7on: Avoid Unwanted Priority

Critical path delay: 2x1 MUX process (clk, rst) begin if rising_edge(clk) then if c[0]=‘1’ then r[0] <= din; end if; if c[1]=‘1’ then r[1] <= din; end if; if c[2]=‘1’ then r[2] <= din; end if; if c[3]=‘1’ then r[3] <= din; end if; end if; end process;

SLIDE 19

19

Timing Op7miza7on: Avoid Unwanted Priority

Critical path delay: 2x1 MUX

SLIDE 20

20

Timing Op7miza7on: Register Balancing

Maximal clock frequency determined by the longest path

delay in any combina2onal logic blocks. din

block 1 block 2

dout din

block 1 block 2

dout

SLIDE 21

Timing Op7miza7on: Register Balancing

process process (clk, rst) begin begin if if rising_edge(clk) then then rA <= A; rB <= B; rC <= C; sum <= rA + rB + rC; end if end if; end process end process; process process (clk, rst) begin begin if if rising_edge(clk) then then sumAB <= A + B; rC <= C; sum <= sumAB + rC; end if end if; end process end process;

SLIDE 22

Timing Op7miza7on: Register Balancing

process process (clk, rst) begin begin if if rising_edge(clk) then then rA <= A; rB <= B; rC <= C; sum <= rA + rB + rC; end if end if; end process end process;

SLIDE 23

Timing Op7miza7on: Register Balancing

process process (clk, rst) begin begin if if rising_edge(clk) then then sumAB <= A + B; rC <= C; sum <= sumAB + rC; end if end if; end process end process;

SLIDE 24

24

Op7miza7on for Area

SLIDE 25

25

Area Op7miza7on: Resource Sharing

Rolling up pipleline: share common resources at different

2me – a form of temporal compu2ng din dout din dout

…

stage 1 stage 2 stage n Block including all all logic in stage 1 to n.

SLIDE 26

26

Area Op7miza7on: Resource Sharing

Use registers to hold inputs
Develop FSM to select which inputs to process in each

cycle. X = A + B + C + D

+ + +

A B C D X

SLIDE 27

27

Area Op7miza7on: Resource Sharing

Use registers to hold inputs
Develop FSM to select which inputs to process in each

cycle. X = A + B + C + D

+ + +

A B C D X

+

X A B C D

A, B, C, D need to hold steady until X is processed

control

SLIDE 28

28

Area Op7miza7on: Resource Sharing

Merge duplicate components together

SLIDE 29

29

Area Op7miza7on: Resource Sharing

Merge duplicate components together – reduces a 8-bit counter

SLIDE 30

30

Impact of Reset on Area (Xilinx Specific)

These coding guidelines:

– Minimize slice logic utilization. – Maximize circuit performance. – Utilize device resources such as block RAM components and DSP blocks.

Do not set or reset Registers asynchronously.

– Control set remapping becomes impossible. – Sequential functionality in device resources such as block RAM components and DSP blocks can be set or reset synchronously only. – You will be unable to leverage device resources resources, or they will be confjgured sub-optimally. – Use synchronous initialization instead.

Use Asynchronous to Synchronous if your own coding guidelines require Registers

to be set or reset asynchronously. This allows you to assess the benefjts of using synchronous set/reset.

Do not describe Flip-Flops with both a set and a reset.

– No Flip-Flop primitives feature both a set and a reset, whether synchronous

r asynchronous.

– If not rejected by the software, Flip-Flop primitives featuring both a set and a reset may adversely affect area and performance.

Do not describe Flip-Flops with both an asynchronous reset and an asynchronous
set. XST rejects such Flip-Flops rather than retargeting them to a costly equivalent

model.

Avoid operational set/reset logic whenever possible. There may be other, less

expensive, ways to achieve the desired effect, such as taking advantage of the circuit global reset by defjning an initial contents.

Always describe the clock enable, set, and reset control inputs of Flip-Flop primitives

as active-High. If they are described as active-Low, the resulting inverter logic will penalize circuit performance.

Pack I/O Registers Into IOBs
Register Duplication
Equivalent Register Removal
Register Balancing
Asynchronous to Synchronous

For other ways to control implementation of Flip-Flops and Registers, see Mapping Logic to LUTs.

These coding guidelines:

– Minimize slice logic utilization. – Maximize circuit performance. – Utilize device resources such as block RAM components and DSP blocks.

Do not set or reset Registers asynchronously.

– Control set remapping becomes impossible. – Sequential functionality in device resources such as block RAM components and DSP blocks can be set or reset synchronously only. – You will be unable to leverage device resources resources, or they will be confjgured sub-optimally. – Use synchronous initialization instead.

Use Asynchronous to Synchronous if your own coding guidelines require Registers

to be set or reset asynchronously. This allows you to assess the benefjts of using synchronous set/reset.

Do not describe Flip-Flops with both a set and a reset.

– No Flip-Flop primitives feature both a set and a reset, whether synchronous

r asynchronous.

– If not rejected by the software, Flip-Flop primitives featuring both a set and a reset may adversely affect area and performance.

Do not describe Flip-Flops with both an asynchronous reset and an asynchronous
set. XST rejects such Flip-Flops rather than retargeting them to a costly equivalent

model.

Avoid operational set/reset logic whenever possible. There may be other, less

expensive, ways to achieve the desired effect, such as taking advantage of the circuit global reset by defjning an initial contents.

Always describe the clock enable, set, and reset control inputs of Flip-Flop primitives

as active-High. If they are described as active-Low, the resulting inverter logic will penalize circuit performance.

Pack I/O Registers Into IOBs
Register Duplication
Equivalent Register Removal
Register Balancing
Asynchronous to Synchronous

For other ways to control implementation of Flip-Flops and Registers, see Mapping Logic to LUTs.

SLIDE 31

Reset or No Reset?

31

process process (clk) begin begin if if rising_edge(clk) then then if if rst rst = ‘0’ then = ‘0’ then sr sr <= (others <= ‘0’); <= (others <= ‘0’); else else sr <= din & sr(14 downto 0); end if end if; end if; end process end process;

SLIDE 32

Reset or No Reset?

32

process process (clk) begin begin if if rising_edge(clk) then then sr <= din & sr(14 downto 0); end if; end process end process;

SLIDE 33

Reset or No Reset?

33

Table 2.1 Resource Utilization for Shift Register Implementations Implementation Slices slice Flip-flops Resets defined 9 16 No resets defined 1 1

SLIDE 34

34

ReseVng Block RAM

Block RAM only supports synchronous reset.
Suppose that Mem is 256x16b RAM.
Implementa2ons of Mem with synchronous and

asynchronous reset on Xilinx Virtex-4.

Implementation Slices slice Flip-flops 4 Input LUTs BRAMs Asynchronous reset 3415 4112 2388 Synchronous reset 1 VHDL model should match features offered by FPGA building blocks in order for those devices instantiated in the implementation.

SLIDE 35

U7lizing Set/Reset FF Pins

35

Figure 2.11

Simple synchronous logic with OR gate.

Figure 2.12

OR gate implemented with set pin.

Figure 2.14

AND gate implemented with CLR pin.

Figure 2.13

Simple synchronous logic with AND gate.

SLIDE 36

U7lizing Set/Reset FF Pins – Example

36

process (clk, reset) begin if reset=‘0’ then

Dat <= ‘0’;

else

Dat <= iDat1 | iDat2;

end if; end process;

Figure 2.15

Simple asynchronous reset.

SLIDE 37

U7lizing Set/Reset FF Pins – Example

37

process (clk, reset) begin

Dat <= iDat1 | iDat2;

end process;

Figure 2.16

Optimization without reset.

SLIDE 38

38

Op7miza7on for Power

SLIDE 39

39

Power Reduc7on Techniques

In general, FPGAs are power hungry.
Power consump2on is determined by

where V is voltage, C is load capacitance, and f is switching frequency

In FPGAs, V is usually fixed, C depends on the number of

switching gates and length of wires connec2ng all gates.

To reduce power,
turn off gates not ac2vely used,
have mul2ple clock domains,
reduce f.

P = V 2 · C · f

SLIDE 40

40

Dual-EdgeTriggered FFs

A design that is ac2ve on both clock edges can reduce

clock frequency by 50%. din dout

stage 1 stage 2 stage n stage 4

din dout

stage 1 stage 2 stage n stage 4

Example 1 Example 2

posi2vely triggered nega2vely triggered

SLIDE 41

41

Dual-EdgeTriggered FFs – Example

process(clk) begin if (rising_edge(clk)) then reg(0) <= din; reg(2) <= reg(1); end if; end process; process(clk) begin if(rising_edge(clk)) then reg(1) <= reg(0); reg(3) <= reg(2); end if; end process;

Synthesizable using Vivado 2016.2