ECE232: Hardware Organization and Design Lecture 9: Floating Point - - PowerPoint PPT Presentation

ece232 hardware organization and design
SMART_READER_LITE
LIVE PREVIEW

ECE232: Hardware Organization and Design Lecture 9: Floating Point - - PowerPoint PPT Presentation

ECE232: Hardware Organization and Design Lecture 9: Floating Point Adapted from Computer Organization and Design , Patterson & Hennessy, UCB Floating Point Representation for non-integral numbers Including very small and very large


slide-1
SLIDE 1

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB

ECE232: Hardware Organization and Design

Lecture 9: Floating Point

slide-2
SLIDE 2

ECE232: Floating Point 2

Floating Point

  • Representation for non-integral numbers
  • Including very small and very large numbers
  • Like scientific notation
  • –2.34 × 1056
  • +0.002 × 10–4
  • +987.02 × 109
  • In binary
  • ±1.xxxxxxx2 × 2yyyy
  • Types float and double in C

normalized

slide-3
SLIDE 3

ECE232: Floating Point 3

  • The largest 32 bit unsigned integer number is

1111 1111 1111 1111 1111 1111 1111 1111 = 4,294,967,295

  • What if we want to encode the approx. age of the earth?

4,600,000,000 or 4.6 x 109

  • r the weight in kg of one a.m.u. (atomic mass unit)

0.0000000000000000000000000166 or 1.6 x 10-27

  • There is no way we can encode either of the above in a 32-

bit integer.

Floating Point Numbers

slide-4
SLIDE 4

ECE232: Floating Point 4

Exponential Notation

The representations differ in that the decimal place – the “point” - “floats” to the left or right (with the appropriate adjustment in the exponent).

  • The following are equivalent representations of 1,234

123,400.0 x 10-2 12,340.0 x 10-1 1,234.0 x 100 123.4 x 101 12.34 x 102 1.234 x 103 0.1234 x 104 0.01234x 105

slide-5
SLIDE 5

ECE232: Floating Point 5

Parts of a Floating Point Number

  • 0.9876 x 10-3

Sign of mantissa Location of decimal point Mantissa Exponent Sign of exponent Base

Mantissa is also called Significand

slide-6
SLIDE 6

ECE232: Floating Point 6

Single Precision Format

  • Note that the exponent has no explicit sign bit
  • Base?

32 bits M: Mantissa (23 bits) E: Exponent (8 bits) S: Sign of mantissa (1 bit)

slide-7
SLIDE 7

ECE232: Floating Point 7

Normalization

  • The mantissa M is a normalized fraction
  • Has an implied decimal place on left
  • Has an implied (hidden) “1” on left of the decimal place
  • E.g.,
  • Fraction  10100000000000000000000
  • Represents 1.1012 = 1.62510
  • The significand=1.f is in the range [1, 2-ulp]
  • ulp – unit in the last position

Bias E S

f F

   2 . 1 ) 1 (

slide-8
SLIDE 8

ECE232: Floating Point 8

IEEE Floating-Point Format

  • S: sign bit (0  non-negative, 1  negative)
  • Normalize significand: 1.0 ≤ |significand| < 2.0
  • Always has a leading pre-binary-point 1 bit, so no need to

represent it explicitly (hidden bit)

  • Significand is Fraction with the “1.” restored
  • Exponent: excess representation: actual exponent + Bias
  • Ensures exponent is unsigned
  • Single: Bias = 127; Double: Bias = 1203

S Exponent Fraction

single: 8 bits double: 11 bits single: 23 bits double: 52 bits

Bias) (Exponent S

2 Fraction) (1 1) ( x

    

slide-9
SLIDE 9

ECE232: Floating Point 9

Single-Precision Range

  • Exponents 00000000 and 11111111 reserved
  • Smallest value
  • Exponent: 00000001

 actual exponent = 1 – 127 = –126

  • Fraction: 000…00  significand = 1.0
  • ±1.0 × 2–126 ≈ ±1.2 × 10–38
  • Largest value
  • exponent: 11111110

 actual exponent = 254 – 127 = +127

  • Fraction: 111…11  significand ≈ 2.0
  • ±2.0 × 2+127 ≈ ±3.4 × 10+38
slide-10
SLIDE 10

ECE232: Floating Point 10

Floating-Point Example

  • Represent –0.75
  • –0.75 = (–1)1 × 1.12 × 2–1
  • S = 1
  • Fraction = 1000…002
  • Exponent = –1 + Bias
  • Single: –1 + 127 = 126 = 011111102
  • Double: –1 + 1023 = 1022 = 011111111102
  • Single: 1011111101000…00
  • Double: 1011111111101000…00
slide-11
SLIDE 11

ECE232: Floating Point 11

Floating-Point Example

  • What number is represented by the single-precision float

11000000101000…00

  • S = 1
  • Fraction = 01000…002
  • Fxponent = 100000012 = 129
  • x = (–1)1 × (1 + 012) × 2(129 – 127)

= (–1) × 1.25 × 22 = –5.0

slide-12
SLIDE 12

ECE232: Floating Point 12

Floating-Point Addition

  • Consider a 4-digit decimal example
  • 9.999 × 101 + 1.610 × 10–1
  • 1. Align decimal points
  • Shift number with smaller exponent
  • 9.999 × 101 + 0.016 × 101
  • 2. Add significands
  • 9.999 × 101 + 0.016 × 101 = 10.015 × 101
  • 3. Normalize result & check for over/underflow
  • 1.0015 × 102
  • 4. Round and renormalize if necessary
  • 1.002 × 102
slide-13
SLIDE 13

ECE232: Floating Point 13

Floating-Point Addition

  • Now consider a 4-digit binary example
  • 1.0002 × 2–1 + –1.1102 × 2–2 (0.5 + –0.4375)
  • 1. Align binary points
  • Shift number with smaller exponent
  • 1.0002 × 2–1 + –0.1112 × 2–1
  • 2. Add significands
  • 1.0002 × 2–1 + –0.1112 × 2–1 = 0.0012 × 2–1
  • 3. Normalize result & check for over/underflow
  • 1.0002 × 2–4, with no over/underflow
  • 4. Round and renormalize if necessary
  • 1.0002 × 2–4 (no change) = 0.0625
slide-14
SLIDE 14

ECE232: Floating Point 14

Steps in Addition/Subtraction

  • Step 1: Calculate difference d of the two exponents -

d=|E1 - E2|

  • Step 2: Shift significand of smaller number by d positions to

the right

  • Step 3: Add aligned significands and set exponent of result

to exponent of larger operand

  • Step 4: Normalize resultant significand and adjust exponent

if necessary

  • Step 5: Round resultant significand and adjust exponent if

necessary

Source: I. Koren, Computer Arithmetic Algorithms, 2nd Edition, 2002

slide-15
SLIDE 15

ECE232: Floating Point 15

Example: Single precision

0 10000010 11010000000000000000000 1.11012 130 – 127 = 3 0 = positive mantissa

+1.11012 x 23 = 1110.12 = 14.510

slide-16
SLIDE 16

ECE232: Floating Point 16

Converting to IEEE format

  • Example - decimal number: -3.154 X 100
  • What is the sign?
  • What is the exponent?
  • What is the mantissa?

456.7810 = 4 x 102 + 5 x 101 + 6 x 100 + 7 x 10-1+8 x 10-2 1011.112 = 1 x 23 + 0 x 22 + 1 x 21 + 1 x 20 + 1 x 2-1 + 1 x 2-2 = 8 + 0 + 2 + 1 + 1/2 + ¼ = 11 + 0.5 + 0.25 = 11.7510

Converting Mixed Numbers – Decimal to Binary

slide-17
SLIDE 17

ECE232: Floating Point 17

How to convert whole Decimal to Binary

  • Successive division by 2
  • 5714310 = 11011111001101112

1 1 1 3 6 1 13 1 27 1 55 1 111 1 223 446 892 1 1785 1 3571 7142 1 14285 1 28571 1 57143

slide-18
SLIDE 18

ECE232: Floating Point 18

Converting fractional Decimal to Binary

0.154 1 0.308 2 0.616 3 1.232 1 4 0.464 5 0.928 6 1.856 1 7 1.712 1 8 1.424 1 9 0.848 10 1.696 1 11 1.392 1 12 0.784 13 1.568 1 14 1.136 1 15 0.272 16 0.544 17 1.088 1 18 0.176 19 0.352 20 0.704 21 1.408 1 22 0.816 23 1.632 1

Decimal 0.154 = .0010 0111 0110 1100 1000 101 Successive multiplication by 2

slide-19
SLIDE 19

ECE232: Floating Point 19

Floating Point Special Representations

127

2 . 1 ) 1 (

  

E S

f F 2 . 1 1   f

  • There are two Zeroes, 0, and two Infinities ∞
  • NaN (Not-a-Number) may have a sign and have a non-zero fraction -

used for program diagnostics

  • NaNs and Infinities have all 1s in the Exp field, E=255.

F+=, F/=0

Source: I. Koren, Computer Arithmetic Algorithms, 2nd Edition, 2002

slide-20
SLIDE 20

ECE232: Floating Point 20

Floating Point Special Representations

Single Precision Double Precision Object represented Exponent Fraction Exponent Fraction nonzero nonzero ± denormalized number 1-254 Anything 1-2046 Anything ± floating point number 255 2047 ± infinity 255 nonzero 2047 nonzero NaN (not a number) 127

2 . 1 ) 1 (

  

E S

f F 2 . 1 1   f

1  E  254

slide-21
SLIDE 21

ECE232: Floating Point 21

Smallest & Largest Numbers

  • The smallest non-zero positive and largest non-zero negative

normalized numbers (represented by 1 in the Exp field and 0…0 in the Fraction field) are

  • ±2−126 ≈ ±1.175494351×10−38
  • The smallest non-zero positive and largest non-zero negative

denormalized numbers (represented by all 0s in the Exp field and 0…01 in the Fraction field) are

  • ±2−149 ≈ ±1.4012985×10−45
  • The largest finite positive and smallest finite negative numbers

(represented by 254 in the Exp field and 1…1 in the Fraction field) are

  • ±(2)(2127)≈ ±3.40×1038
slide-22
SLIDE 22

ECE232: Floating Point 22

FP Adder Hardware

Step 1 Step 2 Step 3 Step 4

slide-23
SLIDE 23

ECE232: Floating Point 23

Single Precision Summary

NaN 010 0000 0000 0000 0000 0000 1111 1111 NaN Infinity 000 0000 0000 0000 0000 0000 1111 1111 Infinity 1.18×10-38 000 0000 0000 0000 0000 0000 0000 0001 Smallest normalized number 3.4×1038 111 1111 1111 1111 1111 1111 1111 1110 Largest normalized number 5.9×10-39 100 0000 0000 0000 0000 0000 0000 0000 Denormalized number 1 000 0000 0000 0000 0000 0000 0111 1111 One 000 0000 0000 0000 0000 0000 0000 0000 Zero Value Mantissa Exponent Type

slide-24
SLIDE 24

ECE232: Floating Point 24

Summary

  • Floating point numbers represent large numbers with fractions
  • Number formats are different than 2’s complement.
  • Requires some memorization
  • Addition requires aligning, adding, and then realigning
  • Do examples!
  • The best way to learn floating point operations