[PPT] - ECE232: Hardware Organization and Design Lecture 9: Floating Point PowerPoint Presentation

SLIDE 1

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB

ECE232: Hardware Organization and Design

Lecture 9: Floating Point

SLIDE 2

ECE232: Floating Point 2

Floating Point

Representation for non-integral numbers
Including very small and very large numbers
Like scientific notation
–2.34 × 1056
+0.002 × 10–4
+987.02 × 109
In binary
±1.xxxxxxx2 × 2yyyy
Types float and double in C

normalized

SLIDE 3

ECE232: Floating Point 3

The largest 32 bit unsigned integer number is

1111 1111 1111 1111 1111 1111 1111 1111 = 4,294,967,295

What if we want to encode the approx. age of the earth?

4,600,000,000 or 4.6 x 109

r the weight in kg of one a.m.u. (atomic mass unit)

0.0000000000000000000000000166 or 1.6 x 10-27

There is no way we can encode either of the above in a 32-

bit integer.

Floating Point Numbers

SLIDE 4

ECE232: Floating Point 4

Exponential Notation

The representations differ in that the decimal place – the “point” - “floats” to the left or right (with the appropriate adjustment in the exponent).

The following are equivalent representations of 1,234

123,400.0 x 10-2 12,340.0 x 10-1 1,234.0 x 100 123.4 x 101 12.34 x 102 1.234 x 103 0.1234 x 104 0.01234x 105

SLIDE 5

ECE232: Floating Point 5

Parts of a Floating Point Number

0.9876 x 10-3

Sign of mantissa Location of decimal point Mantissa Exponent Sign of exponent Base

Mantissa is also called Significand

SLIDE 6

ECE232: Floating Point 6

Single Precision Format

Note that the exponent has no explicit sign bit
Base?

32 bits M: Mantissa (23 bits) E: Exponent (8 bits) S: Sign of mantissa (1 bit)

SLIDE 7

ECE232: Floating Point 7

Normalization

The mantissa M is a normalized fraction
Has an implied decimal place on left
Has an implied (hidden) “1” on left of the decimal place
E.g.,
Fraction  10100000000000000000000
Represents 1.1012 = 1.62510
The significand=1.f is in the range [1, 2-ulp]
ulp – unit in the last position

Bias E S

f F



   2 . 1 ) 1 (

SLIDE 8

ECE232: Floating Point 8

IEEE Floating-Point Format

S: sign bit (0  non-negative, 1  negative)
Normalize significand: 1.0 ≤ |significand| < 2.0
Always has a leading pre-binary-point 1 bit, so no need to

represent it explicitly (hidden bit)

Significand is Fraction with the “1.” restored
Exponent: excess representation: actual exponent + Bias
Ensures exponent is unsigned
Single: Bias = 127; Double: Bias = 1203

S Exponent Fraction

single: 8 bits double: 11 bits single: 23 bits double: 52 bits

Bias) (Exponent S

2 Fraction) (1 1) ( x



    

SLIDE 9

ECE232: Floating Point 9

Single-Precision Range

Exponents 00000000 and 11111111 reserved
Smallest value
Exponent: 00000001

 actual exponent = 1 – 127 = –126

Fraction: 000…00  significand = 1.0
±1.0 × 2–126 ≈ ±1.2 × 10–38
Largest value
exponent: 11111110

 actual exponent = 254 – 127 = +127

Fraction: 111…11  significand ≈ 2.0
±2.0 × 2+127 ≈ ±3.4 × 10+38

SLIDE 10

ECE232: Floating Point 10

Floating-Point Example

Represent –0.75
–0.75 = (–1)1 × 1.12 × 2–1
S = 1
Fraction = 1000…002
Exponent = –1 + Bias
Single: –1 + 127 = 126 = 011111102
Double: –1 + 1023 = 1022 = 011111111102
Single: 1011111101000…00
Double: 1011111111101000…00

SLIDE 11

ECE232: Floating Point 11

Floating-Point Example

What number is represented by the single-precision float

11000000101000…00

S = 1
Fraction = 01000…002
Fxponent = 100000012 = 129
x = (–1)1 × (1 + 012) × 2(129 – 127)

= (–1) × 1.25 × 22 = –5.0

SLIDE 12

ECE232: Floating Point 12

Floating-Point Addition

Consider a 4-digit decimal example
9.999 × 101 + 1.610 × 10–1
1. Align decimal points
Shift number with smaller exponent
9.999 × 101 + 0.016 × 101
2. Add significands
9.999 × 101 + 0.016 × 101 = 10.015 × 101
3. Normalize result & check for over/underflow
1.0015 × 102
4. Round and renormalize if necessary
1.002 × 102

SLIDE 13

ECE232: Floating Point 13

Floating-Point Addition

Now consider a 4-digit binary example
1.0002 × 2–1 + –1.1102 × 2–2 (0.5 + –0.4375)
1. Align binary points
Shift number with smaller exponent
1.0002 × 2–1 + –0.1112 × 2–1
2. Add significands
1.0002 × 2–1 + –0.1112 × 2–1 = 0.0012 × 2–1
3. Normalize result & check for over/underflow
1.0002 × 2–4, with no over/underflow
4. Round and renormalize if necessary
1.0002 × 2–4 (no change) = 0.0625

SLIDE 14

ECE232: Floating Point 14

Steps in Addition/Subtraction

Step 1: Calculate difference d of the two exponents -

d=|E1 - E2|

Step 2: Shift significand of smaller number by d positions to

the right

Step 3: Add aligned significands and set exponent of result

to exponent of larger operand

Step 4: Normalize resultant significand and adjust exponent

if necessary

Step 5: Round resultant significand and adjust exponent if

necessary

Source: I. Koren, Computer Arithmetic Algorithms, 2nd Edition, 2002

SLIDE 15

ECE232: Floating Point 15

Example: Single precision

0 10000010 11010000000000000000000 1.11012 130 – 127 = 3 0 = positive mantissa

+1.11012 x 23 = 1110.12 = 14.510

SLIDE 16

ECE232: Floating Point 16

Converting to IEEE format

Example - decimal number: -3.154 X 100
What is the sign?
What is the exponent?
What is the mantissa?

456.7810 = 4 x 102 + 5 x 101 + 6 x 100 + 7 x 10-1+8 x 10-2 1011.112 = 1 x 23 + 0 x 22 + 1 x 21 + 1 x 20 + 1 x 2-1 + 1 x 2-2 = 8 + 0 + 2 + 1 + 1/2 + ¼ = 11 + 0.5 + 0.25 = 11.7510

Converting Mixed Numbers – Decimal to Binary

SLIDE 17

ECE232: Floating Point 17

How to convert whole Decimal to Binary

Successive division by 2
5714310 = 11011111001101112

1 1 1 3 6 1 13 1 27 1 55 1 111 1 223 446 892 1 1785 1 3571 7142 1 14285 1 28571 1 57143

SLIDE 18

ECE232: Floating Point 18

Converting fractional Decimal to Binary

0.154 1 0.308 2 0.616 3 1.232 1 4 0.464 5 0.928 6 1.856 1 7 1.712 1 8 1.424 1 9 0.848 10 1.696 1 11 1.392 1 12 0.784 13 1.568 1 14 1.136 1 15 0.272 16 0.544 17 1.088 1 18 0.176 19 0.352 20 0.704 21 1.408 1 22 0.816 23 1.632 1

Decimal 0.154 = .0010 0111 0110 1100 1000 101 Successive multiplication by 2

SLIDE 19

ECE232: Floating Point 19

Floating Point Special Representations

127

2 . 1 ) 1 (



  

E S

f F 2 . 1 1   f

There are two Zeroes, 0, and two Infinities ∞
NaN (Not-a-Number) may have a sign and have a non-zero fraction -

used for program diagnostics

NaNs and Infinities have all 1s in the Exp field, E=255.

F+=, F/=0

Source: I. Koren, Computer Arithmetic Algorithms, 2nd Edition, 2002

SLIDE 20

ECE232: Floating Point 20

Floating Point Special Representations

Single Precision Double Precision Object represented Exponent Fraction Exponent Fraction nonzero nonzero ± denormalized number 1-254 Anything 1-2046 Anything ± floating point number 255 2047 ± infinity 255 nonzero 2047 nonzero NaN (not a number) 127

2 . 1 ) 1 (



  

E S

f F 2 . 1 1   f

1  E  254

SLIDE 21

ECE232: Floating Point 21

Smallest & Largest Numbers

The smallest non-zero positive and largest non-zero negative

normalized numbers (represented by 1 in the Exp field and 0…0 in the Fraction field) are

±2−126 ≈ ±1.175494351×10−38
The smallest non-zero positive and largest non-zero negative

denormalized numbers (represented by all 0s in the Exp field and 0…01 in the Fraction field) are

±2−149 ≈ ±1.4012985×10−45
The largest finite positive and smallest finite negative numbers

(represented by 254 in the Exp field and 1…1 in the Fraction field) are

±(2)(2127)≈ ±3.40×1038

SLIDE 22

ECE232: Floating Point 22

FP Adder Hardware

Step 1 Step 2 Step 3 Step 4

SLIDE 23

ECE232: Floating Point 23

Single Precision Summary

NaN 010 0000 0000 0000 0000 0000 1111 1111 NaN Infinity 000 0000 0000 0000 0000 0000 1111 1111 Infinity 1.18×10-38 000 0000 0000 0000 0000 0000 0000 0001 Smallest normalized number 3.4×1038 111 1111 1111 1111 1111 1111 1111 1110 Largest normalized number 5.9×10-39 100 0000 0000 0000 0000 0000 0000 0000 Denormalized number 1 000 0000 0000 0000 0000 0000 0111 1111 One 000 0000 0000 0000 0000 0000 0000 0000 Zero Value Mantissa Exponent Type

SLIDE 24

ECE232: Floating Point 24

Summary

Floating point numbers represent large numbers with fractions
Number formats are different than 2’s complement.
Requires some memorization
Addition requires aligning, adding, and then realigning
Do examples!
The best way to learn floating point operations