Peak Performance Model for a Custom Precision Floating-Point Dot - - PowerPoint PPT Presentation

▶

Sep 24, 2023 225 likes •409 views

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs UCHPC - UnConventional High performance Computing Workshop

SLIDE 1

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work

Peak Performance Model for a Custom Precision Floating-Point Dot Product

n FPGAs

UCHPC - UnConventional High performance Computing Workshop Europar 2010

Manfred M¨ ucke, Bernd Lesser, Wilfried N. Gansterer {manfred.muecke|bernd.lesser|wilfried.gansterer}@univie.ac.at Research Lab Computational Technologies and Applications University of Vienna http://rlcta.univie.ac.at

August 30th, 2010

Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

SLIDE 2

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work

1 Motivation 2 Architecture 3 Experiments 4 Dot-product performance model 5 Conclusions 6 Future work

Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

SLIDE 3

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work

Motivation

Accelerating scientific applications For instance:

accelerating linear solvers accelerating matrix operations ...

A central part of many scientific computing applications: dot-product operation Our work deals with Performance analysis of custom-precision dot-product architectures on FPGAs

Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

SLIDE 4

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work

Why on FPGAs? There are applications that do not require double-precision data types: Keep double precision range (11bit exponent) Reduce mantissa (mantissa bit width ≤ 52) On CPUs / GPUs: Speedup can only be achieved if mantissa bit width

= 23 bit (single precision) or = 10 bit (half precision)

On FPGAs: FPGAs are the only hardware platform that can benefit from bit width reduction

n a fine-scaled level

Lower precision translates directly into

increased parallelism → throughput → SPEEDUP

Larger FPGAs translate into

increased parallelism → throughput → SPEEDUP

Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

SLIDE 5

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work

Dot-product: Our observation: The maximum size of a parallel floating-point dot-product on FPGAs scales superlinearly with decreasing mantissa bit width Question: How much more performance can we gain? Goal: Give a quantitative model for the performance improvement as function of the mantissa bitwidth

Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

SLIDE 6

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work

Architecture

Canonical Dot-Product for real valued input vectors a, b: < a, b >= aT b =

X

i=1

ai · bi. Different possibilities to implement a dot-product in hardware Our choice: binary-tree based dot-product architecture Thus: m parallel multipliers m − 1 adders

Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

∗ ∗ ∗ ∗ + + + result

}

m

}

m − 1

SLIDE 7

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work

Splitting (arbitrary long) vectors: < a, b >=

X

i=1

ai · bi =

⌊ n

m ⌋−1

X

j=0 m

X

i=1

ai+j·m · bi+j·m +

X

i=⌊ n

m ⌋·m+1

ai · bi We investigate: Custom dot-product operator accepting a maximum input vector length m for different floating-point mantissa bit widths

Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

∗ ∗ ∗ ∗ + + + b1 · · · · b6 b1 · · · · bn a1 · · · · a6 a1 · · · · an b4 · b6 b4 · bm b4 · b6 b4 · bm b1 · · · b5 b1 · · · bm b1 · · · b5 b1 · · · bm b1 · · · b5 b1 · · · bm a1 · · · a5 a1 · · · am b1 · · · b5 b1 · · · bm a1 · · · a5 a1 · · · am X

}

ur focus

SLIDE 8

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work

Given a certain sized FPGA, we want to know: Peak performance as a function of the used mantissa bit width Dot-Product architecture: peak performance depends on Number of parallel multipliers mmax Maximum frequency fmax Thus, we need: Implementation for each mantissa bit width Measure its hardware resource usage

Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

∗ ∗ ∗ ∗ + + +

SLIDE 9

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work

Experiments

Implementation issues: We implemented a generic dot-product architecture for arbitrary vector lengths Standard IEEE 754 floating-point format Arbitrary precision floating-point modules:

chosen library: FPLibrary (Arnaire project, at ENS Lyon)

http://www.ens-lyon.fr/LIP/Arenaire/Ware/FPLibrary/

combinatorial operators used

Measurement issues: Used synthesis tool: QuartusII (Altera) Automated measurements using TCL scripting language

Set generics Synthesize implementation Record hardware resource usage

Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

SLIDE 10

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work

Our implementation:

Accepts generic parameters: mantissa bit width, exponent bitwidth, m m parallel multipliers (accepts 2m input operands) Binary adder tree of depth ⌈log2m⌉ Stages pipelined (registers) Total latency: ⌈log2m⌉ + 3

Peak performance:

P = (2m − 1) ∗ fmax [Flop/s]

Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

mult a0 b0 mult a1 b1 mult a2 b2 mult a3 b3 adder adder adder result

SLIDE 11

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work

Methodology: First: we perform measurements on largest Cyclone II FPGA device (EP2C70) Then: Develop model for approximating best these original measurements Finally: Verify the model class with the measurements obtained from two more recent devices

FPGA FPGA Logic elements DSP blocks

Emb. Memory

Device Family [9x9bit blocks] [kbits]

EP2C70 Cyclone II 68,416 300 1,125 EP3C80 Cyclone III 81,264 488 2,745 EP3SL70 Stratix III 67,500 576 2,214

Table: Hardware resources of used FPGAs.

Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

SLIDE 12

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work

FPGA Peak Perf vs. Mantissa bit width Measure maximum dot-prod size mmax and maximum frequency fmax Mantissa sizes: 52 downto 4 Calculate peak performance P = (2m − 1) ∗ fmax [Flop/s] Observation: peak performance grows exponentially

Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

2 4 6 8 10 12 14 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 peak performance [GFlop/s] mantissa bit width [Bits] Dot Product Peak Performance EP2C70 Peak 20 40 60 80 100 120 140 160 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 20 25 30 35 40 45 50 55 60 65 70 maximum input operand pairs maximum clock frequency [MHz] mantissa bit width [Bits] Maximum Dot Product Size

EP2C70max. dot product size

EP2C70 fmax

SLIDE 13

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work

Dot-product performance model

Model: Fit: fractional polynomial of form P(p) = c1 + c2 · pc3, c1, c2, c3 ∈ Q EP2C70: P(p) = −7.37 + 32.16 · p−0.35

Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

Errorabs = P − ˆ P [Flop/s] Errorrel = (P − ˆ

P ) P

· 100 [%]

ˆ P := Modelled value P := Measured value P = (2m − 1) ∗ fmax [Flop/s]

2 4 6 8 10 12 14

peak performance [GFlop/s] Dot Product Peak Performance Model

Fit: PEP2C70(p) = -7.37 + 32.16*(p-0.35) EP2C70 Peak

10 20 relative error [%]

0.5 1 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 absolute error [GFlop/s]

mantissa bit width [Bits]

SLIDE 14

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work

Verify observations on more recent FPGA devices (families): Given appropriate constants, peak performance as a function of mantissa bit width can be modeled quite accurately Maximum absolute error: 1GFlop/s Average relative error: ≈ 5 − 7%

Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

5 10 15 20 25

peak performance [GFlop/s] Dot Product Peak Performance Model

EP2C70 Peak Fit: PEP2C70(p) = -7.37 + 32.16*(p-0.35) EP3C80 Peak Fit: PEP3C80(p) = -10.31 + 43.29*(p-0.33) EP3SL70 Peak Fit: PEP3SL70(p) = -19.68 + 60.90*(p-0.26)

10 20 relative error [%]

0.5 1 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 absolute error [GFlop/s]

mantissa bit width [Bits]

SLIDE 15

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work

Conclusions

FPGA

Perf. Model [GFlop/s]

Device family

EP2C70 −7.37 + 32.16 · p−0.35

CycloneII

(EP2C5,EP2C8,EP2C15,EP2C20,EP2C35,EP2C50,EP2C70)

EP3C80 −10.31 + 43.29 · p−0.33

CycloneIII

(EP3C5,EP3C10,EP3C16,EP3C25,EP3C40,EP3C55,EP3C80,EP3C120)

EP3SL70 −19.68 + 60.90 · p−0.26

StratixIII

(EP3SL50,EP3SL70,EP3SL110,EP3SL150,EP3SL200,EP3SE260,EP3SL340)

Table: Performance Model Overview

We have shown: Performance benefit of reduced precision can be reliably quantified for the devices considered and for comparable settings The model is very simple and requires practically no runtime to compute Can serve as a tool for designers wishing to explore the design space Could stimulate work identifying the minimal required precision of dot-products in given applications

Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

SLIDE 16

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work

Future work

Future work: Robustness of our results (synthesis settings) Behaviour of different floating-point libraries Explore impact of pipelining

Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs

SLIDE 17

Outline Motivation Architecture Experiments Dot-product performance model Conclusions Future work

Thank you for your attention! Questions?

Research Lab Computational Technologies and Applications University of Vienna http://rlcta.univie.ac.at

Bernd Lesser RLCTA Peak Performance Model for a Custom Precision Floating-Point Dot Product on FPGAs