Resource Optimal Design of Large Multipliers for FPGAs Martin Kumm - - PowerPoint PPT Presentation

resource optimal design of large multipliers for fpgas
SMART_READER_LITE
LIVE PREVIEW

Resource Optimal Design of Large Multipliers for FPGAs Martin Kumm - - PowerPoint PPT Presentation

Resource Optimal Design of Large Multipliers for FPGAs Martin Kumm * , Johannes Kappauf * , Matei Istoan and Peter Zipf * * University of Kassel, Germany University Lyon, France 24'th IEEE Symposium on Computer Arithmetic


slide-1
SLIDE 1

Resource Optimal Design of 
 Large Multipliers for FPGAs

Martin Kumm*, Johannes Kappauf*, 
 Matei Istoan† and Peter Zipf*

*University of Kassel, Germany †University Lyon, France


 24'th IEEE Symposium on Computer Arithmetic 25.07.2017

slide-2
SLIDE 2

Motivation

Multiplication is a fundamental arithmetic operation Embedded multipliers available in the FPGA fabric are limited in size (& quantity) Larger multipliers can be decomposed into smaller multipliers realized by DSP blocks or logic resources Question of interest: 
 How to do the decomposition in a (resource) optimal way?

2

slide-3
SLIDE 3

Outline

  • 1. How to formulate the problem as tiling problem?
  • 2. How do the tiles look like?
  • 3. How to solve the problem?

3

slide-4
SLIDE 4

Outline

  • 1. How to formulate the problem as tiling problem?
  • 2. How do the tiles look like?
  • 3. How to solve the problem?

4

slide-5
SLIDE 5

Multiplier Decomposition

5

A × B = (AH2n + AL)(BH2m + BL) = AHBH | {z }

M4

2n+m+ AHBL | {z }

M3

2n + ALBH | {z }

M2

2m+ ALBL | {z }

M1

A large multiplier can be decomposed into several smaller multipliers:

slide-6
SLIDE 6

Multiplier Tiling

6

The multiplier can be graphically represented as an X×Y board which is tiled by smaller multiplier, represented as rectangles [de Dinechin 2009] The required left shift can be obtained from the sum of the tile coordinates (x,y)

16 32 16 32

M1 M2 M4 M3

y ↑ ← x

32×32 board with n=m=16 bit mult.

A×B = (AH216 + AL)(BH216 + BL) = AHBH | {z }

M4

232+ AHBL | {z }

M3

216+ ALBH | {z }

M2

216+ ALBL | {z }

M1

slide-7
SLIDE 7

Multiplier Tiling

7

A valid multiplier tiling is as follows: The board must completely covered without overlaps of the tiles Overlaps with the border of the board are allowed

17 24 34 41 5853 17 24 34 41 58 53

y ↑ ← x

53×53 multiplier
 [de Dinechin 2009]

slide-8
SLIDE 8

Outline

  • 1. How to formulate the problem as tiling problem?
  • 2. How do the tiles look like?
  • 3. How to solve the problem?

8

slide-9
SLIDE 9

Logic-based Tiles

9

Several LUT-based multipliers can be used: 3×3 Mult., which can be mapped to six 6-input LUTs (LUT6) [Brunie 2013] 2×3 Mult. which can be mapped to three LUT6 
 (realizing five LUT5) [Kumm 2015] 1×2 Mult., uses a single LUT6 (realizing two LUT5) In addition, LUT/carry-chain multipliers are used: Single row of an FPGA-optimized 
 Baugh-Wooley multiplier [Parandeh-Afshar 2011]

slide-10
SLIDE 10

Shapes of the Logic-based Tiles

10

3 3

(a) 3 × 3

3 2 2 3

(b) 3 × 2/2 × 3

1 2 2 1

(c) 2 × 1/1 × 2 . . . . . .

k 2

(d) k × 2 . . . . . .

2 k

(e) 2 × k

slide-11
SLIDE 11

LUT Requirements in the Compressor Tree

11

200 400 600 800 1,000 1,200 1,400 1,600 500 1,000 Input bits (#bits) #LUTs multi-input addition x3 operation 0.65 × #bits

slide-12
SLIDE 12

Logic-based Multipliers

12

Cost is composed to: To get the "quality" of a multiplier, an efficiency metric is defined as benefit/cost ratio:

Es = areas costs costs = #LUTm + 0.65ws

Shape Tile area Word size (ws) #LUTm Total cost (costs) Efficiency (Es) 1 × 1 1 1 1 1.65 0.625 1 × 2 2 2 1 2.3 0.87 2 × 3 6 5 3 6.25 0.96 3 × 3 9 6 6 9.9 0.91 2 × k 2k k + 2 k + 1 1.65k + 2.3

2k 1.65k+2.3

(= 1.21 for k → ∞)

slide-13
SLIDE 13

DSP-based Tiles

13

Xilinx DSP blocks contain 18×25 bit (signed)/17×24 bit (unsigned) multipliers They contain additional post-adders These can be used to add a multiplier result already obtained This reduces the size of the compressor tree Graphically, this can be represented as a so-called super-tile [Banescu 2010]

slide-14
SLIDE 14

Super-Tiles of Xilinx FPGAs

14

(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l)

slide-15
SLIDE 15

Outline

  • 1. How to formulate the problem as tiling problem?
  • 2. How do the tiles look like?
  • 3. How to solve the problem?

15

slide-16
SLIDE 16
slide-17
SLIDE 17
slide-18
SLIDE 18
slide-19
SLIDE 19
slide-20
SLIDE 20

Formalizing the Problem

20

Constant/Variable Meaning x, y ∈ N0 Coordinates X, Y ∈ N0 Outer bounds of the multiplier to be designed Mx,y ∈ {0, 1} Shape of the multiplier to be designed; true when (x, y) is within the area of the multiplier S Set of small multipliers with different shape S = |S| Number of available smaller multipliers s ∈{0, 1, . . . , S − 1} Shape index of smaller Multiplier ms

x,y ∈ {0, 1}

Boolean constant describing each small multiplier; true when (x, y) is within the area of the multiplier of shape s costs ∈ R Cost of a small multiplier of shape s ds

x,y ∈ {0, 1}

Decision variable, which is true when multiplier of shape s is placed at coordinate (x, y)

slide-21
SLIDE 21

Specification of a Tile

21

1 2 1 2 3

y ↑ ← x

m0

0,0 = m0 0,1 = m0 0,2 = m0 1,0 = m0 1,1 = 1

Setting
 
 
 
 with all other m's zero would define the following tile:

slide-22
SLIDE 22

ILP Formulation

22

The multiplier tiling problem can be reformulated into an integer linear programming (ILP) as follows:
 
 
 
 
 
 
 
 
 The ILP problem can be solved by using standard solvers

minimize

S−1

X

s=0 X−1

X

x=0 Y −1

X

y=0

costsds

x,y

subject to

S−1

X

s=0 X−1

X

x0=0 Y −1

X

y0=0

ms

x−x0,y−y0ds x0,y0 = 1

9 = ; for 0 ≤ x ≤ X, 0 ≤ y ≤ Y with Mx,y = 1

slide-23
SLIDE 23

ILP Formulation

23

Graphical representation of the left-hand-side of the ILP constraint:


1 2 3 4 5 1 2 3 4 5

y ↑ ← x m0

0,3d0 1,2 = 0

m0

0,2d0 1,2 = 1

m0

0,1d0 1,2 = 1

m0

0,0d0 1,2 = 1

m0

1,1d0 1,2 = 1

m0

1,0d0 1,2 = 1

slide-24
SLIDE 24

The cost of DSP blocks are hard to compare with the cost of LUTs Better to constrain the DSP count of a certain application A single additional constraint can be used to specify the number of DSPs (#DSP):
 
 
 
 
 where Ds specifies the number of DSPs in multiplier shape s

Additional DSP Constraint

24

S−1

X

s=0 X−1

X

x=0 Y −1

X

y=0

Dsds

x,y = #DSP

slide-25
SLIDE 25

Four important cases were considered: 24×24 (single precision) 32×32 53×53 (double precision) 64×64 Each evaluated for varying DSP count up to DSP-only implementation

Results

25

slide-26
SLIDE 26

Resulting Tilings 24/32 Bit

26

24 24

24 × 24, 0 DSP

24 17 24

24 × 24, 1 DSP

24 34 24

24 × 24, 2 DSP

32 32

32 × 32, 0 DSP

24 32 17 32

32 × 32, 1 DSP

17 32 24 32

32 × 32, 2 DSP

6 17 32 24 41 32

32 × 32, 3 DSP

8 32

32 × 32, 4 DSP

slide-27
SLIDE 27

Resulting Tilings 24/32 Bit

26

24 24

24 × 24, 0 DSP

24 17 24

24 × 24, 1 DSP

24 34 24

24 × 24, 2 DSP

32 32

32 × 32, 0 DSP

24 32 17 32

32 × 32, 1 DSP

17 32 24 32

32 × 32, 2 DSP

6 17 32 24 41 32

32 × 32, 3 DSP

8 32

32 × 32, 4 DSP

Baugh-Wooley multiplier
 [Parandeh-Afshar 2011]

slide-28
SLIDE 28

Resulting Tilings 24/32 Bit

26

24 24

24 × 24, 0 DSP

24 17 24

24 × 24, 1 DSP

24 34 24

24 × 24, 2 DSP

32 32

32 × 32, 0 DSP

24 32 17 32

32 × 32, 1 DSP

17 32 24 32

32 × 32, 2 DSP

6 17 32 24 41 32

32 × 32, 3 DSP

8 32

32 × 32, 4 DSP

2×k and 1:2 performs 
 best for LUT-based
 multiplication

slide-29
SLIDE 29

Resulting Tilings 24/32 Bit

26

24 24

24 × 24, 0 DSP

24 17 24

24 × 24, 1 DSP

24 34 24

24 × 24, 2 DSP

32 32

32 × 32, 0 DSP

24 32 17 32

32 × 32, 1 DSP

17 32 24 32

32 × 32, 2 DSP

6 17 32 24 41 32

32 × 32, 3 DSP

8 32

32 × 32, 4 DSP

efficient solution
 utilizing
 two super-tiles

slide-30
SLIDE 30

Resulting Tilings 53 Bit

27

8 24 49 53 17 34 41 53

53 × 53, 5 DSP

24 50 53 17 34 53

53 × 53, 6 DSP

3 17 27 34 53 24 41 58 53

53 × 53, 7 DSP

12 29 41 53 58 12 24 41 58

53 × 53, 8 DSP

12 24 41 58 12 29 41 53 58

53 × 53, 9 DSP

slide-31
SLIDE 31

Resulting Tilings 53 Bit

27

8 24 49 53 17 34 41 53

53 × 53, 5 DSP

24 50 53 17 34 53

53 × 53, 6 DSP

3 17 27 34 53 24 41 58 53

53 × 53, 7 DSP

12 29 41 53 58 12 24 41 58

53 × 53, 8 DSP

12 24 41 58 12 29 41 53 58

53 × 53, 9 DSP

pinwheel inside of a pinwheel logic-mult. consumes 
 1/4 are compared to
 previous hand-optimized 
 design [de Dinechin 2009]


slide-32
SLIDE 32

Resulting Tilings 64 Bit

28

17 34 51 58 64 24 41 58 64

64 × 64, 7 DSP

17 34 58 64 17 24 30 34 58 64

64 × 64, 8 DSP

6 23 40 47 64 6 23 40 47 64

64 × 64, 9 DSP

16 23 40 64 2 16 19 23 33 40 43 47 50 67 64

64 × 64, 10 DSP

24 48 72 13 23 30 47 64

64 × 64, 11 DSP

slide-33
SLIDE 33

Optimization & Synthesis Results

29

Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. fclk [MHz] 24 × 24 [Brunie 2013] 1 216 65 212.4 proposed 1 168 58 10.8% 287.4 [Brunie 2013] 2 418.9 proposed 2 0.0% 418.9 32 × 32 [Banescu 2010] 1024 339 275.8 proposed 1024 276 18.6% 304.4 [Brunie 2013] 1 648 205 192.8 [Banescu 2010] 1 616 234 352.6 proposed 1 616 180 12.2% 302.5 [Brunie 2013] 2 288 94 270.1 proposed 2 256 82 12.8% 338.0 [Brunie 2013] 3 135 75 194.0 [Banescu 2010] 3 176 75 426.6 proposed 3 64 44 41.3% 314.5 [Brunie 2013] 4 17 314.7 [Banescu 2010] 4 40 38 379.4 proposed 4 13 23.5% 181.7

slide-34
SLIDE 34

Optimization & Synthesis Results

29

Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. fclk [MHz] 24 × 24 [Brunie 2013] 1 216 65 212.4 proposed 1 168 58 10.8% 287.4 [Brunie 2013] 2 418.9 proposed 2 0.0% 418.9 32 × 32 [Banescu 2010] 1024 339 275.8 proposed 1024 276 18.6% 304.4 [Brunie 2013] 1 648 205 192.8 [Banescu 2010] 1 616 234 352.6 proposed 1 616 180 12.2% 302.5 [Brunie 2013] 2 288 94 270.1 proposed 2 256 82 12.8% 338.0 [Brunie 2013] 3 135 75 194.0 [Banescu 2010] 3 176 75 426.6 proposed 3 64 44 41.3% 314.5 [Brunie 2013] 4 17 314.7 [Banescu 2010] 4 40 38 379.4 proposed 4 13 23.5% 181.7

less slices because of better 
 logic-based multiplier/compressor tree

slide-35
SLIDE 35

Optimization & Synthesis Results

29

Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. fclk [MHz] 24 × 24 [Brunie 2013] 1 216 65 212.4 proposed 1 168 58 10.8% 287.4 [Brunie 2013] 2 418.9 proposed 2 0.0% 418.9 32 × 32 [Banescu 2010] 1024 339 275.8 proposed 1024 276 18.6% 304.4 [Brunie 2013] 1 648 205 192.8 [Banescu 2010] 1 616 234 352.6 proposed 1 616 180 12.2% 302.5 [Brunie 2013] 2 288 94 270.1 proposed 2 256 82 12.8% 338.0 [Brunie 2013] 3 135 75 194.0 [Banescu 2010] 3 176 75 426.6 proposed 3 64 44 41.3% 314.5 [Brunie 2013] 4 17 314.7 [Banescu 2010] 4 40 38 379.4 proposed 4 13 23.5% 181.7

less slices because of better 
 super-tile usage

slide-36
SLIDE 36

Optimization & Synthesis Results

30

Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. fclk [MHz] 53 × 53 [Banescu 2010] 5 1029 350 298.2 proposed 5 769 295 15.7% 313.2 [Brunie 2013] 6 468 196 214.1 [Banescu 2010] 6 721 220 298.2 proposed 6 361 180 8.2% 263.2 [Banescu 2010] 7 313 223 378.9 proposed 7 193 137 38.6% 290.2 [Banescu 2010] 8 265 145 356.4 proposed 8 25 81 44.1% 272.7 [Brunie 2013] 9 162 125 195.6 [Banescu 2010] 9 215 174 255.8 proposed 9 72 42.4% 348.8 64 × 64 [Banescu 2010] 7 1504 614 245.0 proposed 7 1191 430 30.0% 270.5 [Brunie 2013] 8 1188 420 194.2 [Banescu 2010] 8 1096 449 280.7 proposed 8 652 348 17.1% 261.2 [Banescu 2010] 9 864 413 262.9 proposed 9 475 217 47.5% 249.6 [Banescu 2010] 10 592 341 250.7 proposed 10 187 179 47.5% 267.7 [Brunie 2013] 11 270 196 162.8 [Banescu 2010] 11 592 268 225.3 proposed 11 108 44.9% 265.4

slide-37
SLIDE 37

Optimization & Synthesis Results

30

Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. fclk [MHz] 53 × 53 [Banescu 2010] 5 1029 350 298.2 proposed 5 769 295 15.7% 313.2 [Brunie 2013] 6 468 196 214.1 [Banescu 2010] 6 721 220 298.2 proposed 6 361 180 8.2% 263.2 [Banescu 2010] 7 313 223 378.9 proposed 7 193 137 38.6% 290.2 [Banescu 2010] 8 265 145 356.4 proposed 8 25 81 44.1% 272.7 [Brunie 2013] 9 162 125 195.6 [Banescu 2010] 9 215 174 255.8 proposed 9 72 42.4% 348.8 64 × 64 [Banescu 2010] 7 1504 614 245.0 proposed 7 1191 430 30.0% 270.5 [Brunie 2013] 8 1188 420 194.2 [Banescu 2010] 8 1096 449 280.7 proposed 8 652 348 17.1% 261.2 [Banescu 2010] 9 864 413 262.9 proposed 9 475 217 47.5% 249.6 [Banescu 2010] 10 592 341 250.7 proposed 10 187 179 47.5% 267.7 [Brunie 2013] 11 270 196 162.8 [Banescu 2010] 11 592 268 225.3 proposed 11 108 44.9% 265.4

DPS-only solutions with less DPSs
 found

slide-38
SLIDE 38

A method was proposed to optimally solve the multiplier tiling problem using ILP Method allows to trade between DSP and logic resources The problem is trackable for practical multiplier sizes Combined with carefully selected logic-based multipliers and DSP super-tiles, significant resource reductions could be achieved

Conclusion

31

slide-39
SLIDE 39

Thank You!

32

References

[de Dinechin 2009] Large Multipliers with Fewer DSP Blocks FPL 2012 [Kumm 2015] An Efficient Softcore Multiplier Architecture for Xilinx FPGAs, ARITH 2015 [Parandeh-Afshar 2011] Measuring and Reducing the Performance Gap between Embedded and Soft Multipliers on FPGAs, FPL 2011 [Banescu 2010] Multipliers for Floating-Point Double Precision and Beyond on FPGAs, SIGARCH 2010 [Brunie 2013] Arithmetic Core Generation Using Bit Heaps, FPL 2013

slide-40
SLIDE 40
slide-41
SLIDE 41

Resulting LUT Cost

34 24 × 24 (single precision floating point) #DSP 2 1 LUT cost 31.2 179.95 502.8 ∆LUT – 148.75 322.85 CPU [s] 22.7 129 8 32 × 32 (unsigned) #DSP 4 3 2 1 LUT cost 57.85 119.2 256.8 567.95 881.6 ∆LUT – 61.35 137.6 311.15 313.65 CPU [s] 146 320 187 382 19 53 × 53 (double precision floating point) #DSP 9 8 7 6 5 LUT cost 144.3 164.45 307 450.5 759.7 ∆LUT – 20.15 142.55 143.5 309.2 CPU [s] 1433 701 4331 2112 27215 64 × 64 (unsigned) #DSP 11 10 9 8 7 LUT cost 198.25 354.8 570.7 862.5 1192.35 ∆LUT – 156.55 215.9 291.15 329.9 CPU [s] 43031 81149 21382 54001 TO

slide-42
SLIDE 42

35

Efficiency Comparison

10 20 30 40 50 60 70 0.6 0.8 1 1.2 Area E 2 × k 1 × 1 1 × 2 2 × 3 3 × 3

slide-43
SLIDE 43

Problem Shapes Considered

36

(a) Multi-Input addition of 10 numbers with 10 bit each (b) x3 operation for an input word size of 6 bit

slide-44
SLIDE 44

DSP-based Tiles

37

X-Ref Target - Figure 2-1

X

17-Bit Shift 17-Bit Shift

Y Z

1 48 48 4 48

BCIN* ACIN* OPMODE PCIN* MULTSIGNIN* PCOUT* CARRYCASCOUT* MULTSIGNOUT* CREG/C Bypass/Mask CARRYCASCIN* CARRYIN

CARRYINSEL

A:B ALUMODE B B A C M P P P C MULT 25 X 18 A

18 30 3 PATTERNDETECT PATTERNBDETECT

CARRYOUT

4 7 48 48 30 18

P P

5

D

25 25

INMODE BCOUT* ACOUT*

18 30 4 1 30 18

Dual B Register Dual A, D, and Pre-adder

Xilinx DSP48E1 block

slide-45
SLIDE 45

1 1 1

Carry Logic

1

LUT LUT LUT LUT

A Baugh-Wooley-like multiplier that can be efficiently mapped to FPGAs was proposed in [Parandeh-Afshar 2011] Two partial products are generated and added using carry chain Compression tree of already reduced PP's necessary

38

Previous Work

slide-46
SLIDE 46

1 1 1

Carry Logic

1

LUT LUT LUT LUT

A Baugh-Wooley-like multiplier that can be efficiently mapped to FPGAs was proposed in [Parandeh-Afshar 2011] Two partial products are generated and added using carry chain Compression tree of already reduced PP's necessary

full adder

38

Previous Work

slide-47
SLIDE 47

[Walters 2014] Partial-Product Generation and Addition for Multiplication in FPGAs with 6-Input LUTs, ASILOMAR 2014 [Kumm 2015] An Efficient Softcore Multiplier Architecture for Xilinx FPGAs, ARITH 2015 [Walters 2016] Array Multipliers for High Throughput in Xilinx FPGAs with 6-Input LUTs, Computers, MDPI [Parandeh-Afshar 2011]: Measuring and Reducing the Performance Gap between Embedded and Soft Multipliers on FPGAs, FPL 2011 [de Dinechin 2009] Large Multipliers with Fewer DSP Blocks FPL 2012 [Banescu 2010] Multipliers for Floating-Point Double Precision and Beyond on FPGAs, SIGARCH 2010 [Brunie 2013]: Arithmetic Core Generation Using Bit Heaps, FPL 2013

39

Literature