Computing just right: Application-specific arithmetic
Florent de Dinechin
e
x
√
x2+y2+z2
πx
s i n
e
x+ y n
- i=0
xi
√x
Computing just right: Application-specific arithmetic x x 2+ y 2+ z - - PowerPoint PPT Presentation
Computing just right: Application-specific arithmetic x x 2+ y 2+ z 2 x log x s i x i n e x n e x + i =0 Florent de Dinechin y Outline Anti-introduction: the arithmetic you want in a processor Operator parameterization
e
x
√
x2+y2+z2
πx
e
x+ y n
xi
√x
FPGAs computing Just Right: Application-specific arithmetic 2
FPGAs computing Just Right: Application-specific arithmetic 3
FPGAs computing Just Right: Application-specific arithmetic 4
FPGAs computing Just Right: Application-specific arithmetic 5
the quotient digit should have been zero, therefore we should have subtracted 0, it will be easy to fix.
FPGAs computing Just Right: Application-specific arithmetic 5
FPGAs computing Just Right: Application-specific arithmetic 6
FPGAs computing Just Right: Application-specific arithmetic 6
digit-by-number product subtraction finding the next quotient digit
FPGAs computing Just Right: Application-specific arithmetic 6
digit-by-number product subtraction finding the next quotient digit
FPGAs computing Just Right: Application-specific arithmetic 6
FPGAs computing Just Right: Application-specific arithmetic 7
FPGAs computing Just Right: Application-specific arithmetic 7
Newton-Raphson, Goldschmidt, ... Polynomial approximation (Taylor-like), ...
FPGAs computing Just Right: Application-specific arithmetic 7
FPGAs computing Just Right: Application-specific arithmetic 8
FPGAs computing Just Right: Application-specific arithmetic 8
FPGAs computing Just Right: Application-specific arithmetic 8
FPGAs computing Just Right: Application-specific arithmetic 8
FPGAs computing Just Right: Application-specific arithmetic 8
FPGAs computing Just Right: Application-specific arithmetic 9
FPGAs computing Just Right: Application-specific arithmetic 10
FPGAs computing Just Right: Application-specific arithmetic 11
FPGAs computing Just Right: Application-specific arithmetic 12
FPGAs computing Just Right: Application-specific arithmetic 12
FPGAs computing Just Right: Application-specific arithmetic 13
FPGAs computing Just Right: Application-specific arithmetic 13
FPGAs computing Just Right: Application-specific arithmetic 13
FPGAs computing Just Right: Application-specific arithmetic 14
FPGAs computing Just Right: Application-specific arithmetic 14
FPGAs computing Just Right: Application-specific arithmetic 15
FPGAs computing Just Right: Application-specific arithmetic 16
FPGAs computing Just Right: Application-specific arithmetic 17
256 ?
FPGAs computing Just Right: Application-specific arithmetic 17
256 ?
there probably never will be an instruction “multiply by log(2)” in a general purpose processor.
FPGAs computing Just Right: Application-specific arithmetic 17
FPGAs computing Just Right: Application-specific arithmetic 18
FPGAs computing Just Right: Application-specific arithmetic 19
FPGAs computing Just Right: Application-specific arithmetic 19
FPGAs computing Just Right: Application-specific arithmetic 20
Shift to fixed−point normalize / round Fixed-point X SX EX FX A Z E E
×1/ log(2) × log(2) eA eZ − Z − 1
Y R
1 + wF + g wF + g − k wF + g + 2 − k MSB wF + g + 2 − k wF + g + 1 − k MSB wF + g + 1 − 2k 1 + wF + g wE + wF + g + 1 wE + 1 wE + wF + g + 1 wE + wF + g + 1 k
FPGAs computing Just Right: Application-specific arithmetic 21
FPGAs computing Just Right: Application-specific arithmetic 22
in this exponential, some signals are 12 bits, some 69 bits.
FPGAs computing Just Right: Application-specific arithmetic 22
in this exponential, some signals are 12 bits, some 69 bits.
FPGAs computing Just Right: Application-specific arithmetic 22
in this exponential, some signals are 12 bits, some 69 bits.
dimensions of DSP and RAM blocks LUT cluster size, ...
FPGAs computing Just Right: Application-specific arithmetic 22
in this exponential, some signals are 12 bits, some 69 bits.
dimensions of DSP and RAM blocks LUT cluster size, ...
FPGAs computing Just Right: Application-specific arithmetic 22
Shift to fixed−point normalize / round Fixed-point X SX EX FX A Z E E
×1/ log(2) × log(2) eA eZ − Z − 1
Y R
1 + wF + g wF + g − k wF + g + 2 − k MSB wF + g + 2 − k wF + g + 1 − k MSB wF + g + 1 − 2k 1 + wF + g wE + wF + g + 1 wE + 1 wE + wF + g + 1 wE + wF + g + 1 k
FPGAs computing Just Right: Application-specific arithmetic 23
Shift to fixed−point normalize / round
27 17 9 9 17 9
Fixed-point X SX EX FX A Z E E
×1/ log(2) × log(2) eA eZ − Z − 1
Y R
FPGAs computing Just Right: Application-specific arithmetic 23
Shift to fixed−point normalize / round
27 17 9 9 17 9 18 Kbit ROM (dual−port) DSP
Fixed-point X SX EX FX A Z E E
×1/ log(2) × log(2) eA eZ − Z − 1
Y R
FPGAs computing Just Right: Application-specific arithmetic 23
Shift to fixed−point normalize / round
27 17 9 9 17 9 18 Kbit ROM (dual−port) DSP
Fixed-point X SX EX FX A Z E E
×1/ log(2) × log(2) eA eZ − Z − 1
Y R
FPGAs computing Just Right: Application-specific arithmetic 23
1 + wF 1 + wF 1 + wF 2 + wF + g 2 + wF + g 2 + wF + g 2 + wF + g 2 + wF + g wE + wF + g 2 + wF + g
EC EB MB2 MC 2 X Y Z MX EZ EY EX MY MZ MA2 R
4 + wF + g
shifter sort sort squarer squarer shifter squarer add normalize/pack unpack
FPGAs computing Just Right: Application-specific arithmetic 24
1 + wF 1 + wF 1 + wF 2 + wF + g 2 + wF + g 2 + wF + g 2 + wF + g 2 + wF + g wE + wF + g 2 + wF + g
EC EB MB2 MC 2 X Y Z MX EZ EY EX MY MZ MA2 R
4 + wF + g
shifter sort sort squarer squarer shifter squarer add normalize/pack unpack
FPGAs computing Just Right: Application-specific arithmetic 24
1 + wF 1 + wF 1 + wF 2 + wF + g 2 + wF + g 2 + wF + g 2 + wF + g 2 + wF + g wE + wF + g 2 + wF + g
EC EB MB2 MC 2 X Y Z MX EZ EY EX MY MZ MA2 R
4 + wF + g
shifter sort sort squarer squarer shifter squarer add normalize/pack unpack
FPGAs computing Just Right: Application-specific arithmetic 24
FPGAs computing Just Right: Application-specific arithmetic 25
FPGAs computing Just Right: Application-specific arithmetic 25
FPGAs computing Just Right: Application-specific arithmetic 25
FPGAs computing Just Right: Application-specific arithmetic 26
Error analysis needed ... context-specific implicit knowledge
FPGAs computing Just Right: Application-specific arithmetic 26
Error analysis needed ... context-specific implicit knowledge
at the application level, but also when designing compound components.
Shift to fixed−point normalize / round Fixed-point X SX EX FX A Z E E
×1/ log(2) × log(2) eA eZ − Z − 1
Y R
1 + wF + g wF + g − k wF + g + 2 − k MSB wF + g + 2 − k wF + g + 1 − k MSB wF + g + 1 − 2k 1 + wF + g wE + wF + g + 1 wE + 1 wE + wF + g + 1 wE + wF + g + 1 k
FPGAs computing Just Right: Application-specific arithmetic 26
Error analysis needed ... context-specific implicit knowledge
at the application level, but also when designing compound components.
example: the multiplier by log(2):
◮ small input (12 bits for FP64) ◮ large output (69 bits for FP64)
Shift to fixed−point normalize / round Fixed-point X SX EX FX A Z E E
×1/ log(2) × log(2) eA eZ − Z − 1
Y R
1 + wF + g wF + g − k wF + g + 2 − k MSB wF + g + 2 − k wF + g + 1 − k MSB wF + g + 1 − 2k 1 + wF + g wE + wF + g + 1 wE + 1 wE + wF + g + 1 wE + wF + g + 1 k
FPGAs computing Just Right: Application-specific arithmetic 26
FPGAs computing Just Right: Application-specific arithmetic 27
xxxxx × 11001 xxxxx 00000 00000 xxxxx xxxxx .yyyyyyyyyy
xxxxx × 11001 xxxxx xxxxx xxxxx .yyyyyyyyyy
FPGAs computing Just Right: Application-specific arithmetic 28
xxxxx × 11001 xxxxx 00000 00000 xxxxx xxxxx .yyyyyyyyyy
xxxxx × 11001 xxxxx xxxxx xxxxx .yyyyyyyyyy
two competitive well-researched techniques, tens of publications (well beyond what synthesis tools would optimize out – details later)
FPGAs computing Just Right: Application-specific arithmetic 28
xxxxx × 11001 xxxxx 00000 00000 xxxxx xxxxx .yyyyyyyyyy
xxxxx × 11001 xxxxx xxxxx xxxxx .yyyyyyyyyy
two competitive well-researched techniques, tens of publications (well beyond what synthesis tools would optimize out – details later)
and even more efficient than multiplying by 1/3 (technique shown later) Here, we use a completely different algorithm
FPGAs computing Just Right: Application-specific arithmetic 28
xxxxx × 11001 xxxxx 00000 00000 xxxxx xxxxx .yyyyyyyyyy
xxxxx × 11001 xxxxx xxxxx xxxxx .yyyyyyyyyy
two competitive well-researched techniques, tens of publications (well beyond what synthesis tools would optimize out – details later)
and even more efficient than multiplying by 1/3 (technique shown later) Here, we use a completely different algorithm
FPGAs computing Just Right: Application-specific arithmetic 28
each digit-by digit product is computed twice in a squarer 2321 × 2321 2321 4642 6963 4642 5387041
2321 × 2321 2321 464 69 4 5387041
FPGAs computing Just Right: Application-specific arithmetic 29
each digit-by digit product is computed twice in a squarer 2321 × 2321 2321 4642 6963 4642 5387041
2321 × 2321 2321 464 69 4 5387041
FPGAs computing Just Right: Application-specific arithmetic 29
.10101 × .11001 10101 00000 00000 10101 10101 .0100001101 .01000 rounded to
.10101 × .11001 10101 00000 00000 10101 101011 .0100001 .01000 rounded to
same accuracy with truncated(n+1) as with standard(n) almost half the cost
FPGAs computing Just Right: Application-specific arithmetic 30
This happens in sum of squares, etc – or when physics tells you!
λ
LZC/shift
p + 1 p + 1 p + 1 p + 1 2p + 2 p p p + 1 p
x y z
rounding,normalization and exception handling
mx ex +/– c/f ex − ey
close path
c/f ex ez my
shift |mx − my|
my 1-bit shift ex ez mx
far path
mz, r mz, r
sticky
s g r
prenorm (2-bit shift)
s
FPGAs computing Just Right: Application-specific arithmetic 31
... when the physics tells you so (to be detailed later)
FPGAs computing Just Right: Application-specific arithmetic 32
... when the physics tells you so (to be detailed later)
... when the physics tells you so
FPGAs computing Just Right: Application-specific arithmetic 32
... when the physics tells you so (to be detailed later)
... when the physics tells you so
FPGAs computing Just Right: Application-specific arithmetic 32
FPGAs computing Just Right: Application-specific arithmetic 33
FPGAs computing Just Right: Application-specific arithmetic 34
FPGAs computing Just Right: Application-specific arithmetic 35
FPGAs computing Just Right: Application-specific arithmetic 36
half the hardware required
FPGAs computing Just Right: Application-specific arithmetic 36
half the hardware required
FPGAs computing Just Right: Application-specific arithmetic 36
half the hardware required
5 rounding errors in the floating-point version (x2 + y 2) + z2 : asymmetrical
FPGAs computing Just Right: Application-specific arithmetic 36
half the hardware required
5 rounding errors in the floating-point version (x2 + y 2) + z2 : asymmetrical
FPGAs computing Just Right: Application-specific arithmetic 36
λ
LZC/shift
p + 1 p + 1 p + 1 p + 1 2p + 2 p p p + 1 p
x y z
rounding,normalization and exception handling
mx ex +/– c/f ex − ey
c/f ex ez my
shift |mx − my|
my 1-bit shift ex ez mx
mz, r mz, r
sticky
s g r
prenorm (2-bit shift)
s
FPGAs computing Just Right: Application-specific arithmetic 37
1 + wF 1 + wF 1 + wF 2 + wF + g 2 + wF + g 2 + wF + g 2 + wF + g 2 + wF + g wE + wF + g 2 + wF + g
EC EB MB2 MC 2 X Y Z MX EZ EY EX MY MZ MA2 R
4 + wF + g
shifter sort sort squarer squarer shifter squarer add normalize/pack unpack
FPGAs computing Just Right: Application-specific arithmetic 38
FPGAs computing Just Right: Application-specific arithmetic 39
FPGAs computing Just Right: Application-specific arithmetic 40
FPGAs computing Just Right: Application-specific arithmetic 40
FPGAs computing Just Right: Application-specific arithmetic 41
FPGAs computing Just Right: Application-specific arithmetic 42
Shift to fixed−point normalize / round
27 17 9 9 17 9 18 Kbit ROM (dual−port) DSP
Fixed-point X SX EX FX A Z E E
×1/ log(2) × log(2) eA eZ − Z − 1
Y R
FPGAs computing Just Right: Application-specific arithmetic 43
Shift to fixed−point normalize / round
27 17 9 9 17 9 18 Kbit ROM (dual−port) DSP
Fixed-point X SX EX FX A Z E E
×1/ log(2) × log(2) eA eZ − Z − 1
Y R
FPGAs computing Just Right: Application-specific arithmetic 43
Shift to fixed−point normalize / round
27 17 9 9 17 9 18 Kbit ROM (dual−port) DSP
Fixed-point X SX EX FX A Z E E
×1/ log(2) × log(2) eA eZ − Z − 1
Y R
FPGAs computing Just Right: Application-specific arithmetic 43
Shift to fixed−point normalize / round
27 17 9 9 17 9 18 Kbit ROM (dual−port) DSP
Fixed-point X SX EX FX A Z E E
×1/ log(2) × log(2) eA eZ − Z − 1
Y R
FPGAs computing Just Right: Application-specific arithmetic 43
FPGAs computing Just Right: Application-specific arithmetic 44
multiplier generic polynomial truncated precomputed ROM Constant multipliers evaluator
Shift to fixed−point normalize / round Fixed-point X SX EX FX A Z E E
×1/ log(2) × log(2) eA eZ − Z − 1
Y R
FPGAs computing Just Right: Application-specific arithmetic 45
Never compute 1 bit more accurately than needed! multiplier generic polynomial truncated precomputed ROM Constant multipliers evaluator
Shift to fixed−point normalize / round Fixed-point X SX EX FX A Z E E
×1/ log(2) × log(2) eA eZ − Z − 1
Y R
1 + wF + g wF + g − k wF + g + 2 − k MSB wF + g + 2 − k wF + g + 1 − k MSB wF + g + 1 − 2k 1 + wF + g wE + wF + g + 1 wE + 1 wE + wF + g + 1 wE + wF + g + 1 k
FPGAs computing Just Right: Application-specific arithmetic 45
Never compute 1 bit more accurately than needed! multiplier generic polynomial truncated precomputed ROM Constant multipliers evaluator
Shift to fixed−point normalize / round
generator Need a
Fixed-point X SX EX FX A Z E E
×1/ log(2) × log(2) eA eZ − Z − 1
Y R
1 + wF + g wF + g − k wF + g + 2 − k MSB wF + g + 2 − k wF + g + 1 − k MSB wF + g + 1 − 2k 1 + wF + g wE + wF + g + 1 wE + 1 wE + wF + g + 1 wE + wF + g + 1 k
FPGAs computing Just Right: Application-specific arithmetic 45
FPGAs computing Just Right: Application-specific arithmetic 46
FPGAs computing Just Right: Application-specific arithmetic 46
e
x
√
x2+y2+z2
πx
e
x+ y n
xi
√x
written in C++, outputting VHDL
FPGAs computing Just Right: Application-specific arithmetic 47
e
x
√
x2+y2+z2
πx
e
x+ y n
xi
√x
written in C++, outputting VHDL
“obscure branches” integer, fixed-point, floating-point, logarithm number system all operators fully parameterized flexible pipeline for all operators
FPGAs computing Just Right: Application-specific arithmetic 47
e
x
√
x2+y2+z2
πx
e
x+ y n
xi
√x
written in C++, outputting VHDL
“obscure branches” integer, fixed-point, floating-point, logarithm number system all operators fully parameterized flexible pipeline for all operators
Interface: never output bits that are not numerically meaningful Inside: never compute bits that are not useful to the final result
FPGAs computing Just Right: Application-specific arithmetic 47
few well-typed inputs and outputs no memory or side effect
◮ (even filters are defined by a transfer function)
FPGAs computing Just Right: Application-specific arithmetic 48
few well-typed inputs and outputs no memory or side effect
◮ (even filters are defined by a transfer function)
... mathematically specified in terms of a rounding function e.g. IEEE-754 FP standard: operator(x) = rounding(operation(x))
FPGAs computing Just Right: Application-specific arithmetic 48
few well-typed inputs and outputs no memory or side effect
◮ (even filters are defined by a transfer function)
... mathematically specified in terms of a rounding function e.g. IEEE-754 FP standard: operator(x) = rounding(operation(x))
FPGAs computing Just Right: Application-specific arithmetic 48
FPGAs computing Just Right: Application-specific arithmetic 49
FPGAs computing Just Right: Application-specific arithmetic 49
FPGAs computing Just Right: Application-specific arithmetic 50