Search code examples
floating-pointtheory

Understanding how these floating point numbers work?


I'm having a little difficulty understanding how floating point numbers work. Specifically in the following representations below (correct my mistakes):

  1. Representing 0: this is represented by a full 0 bits in the exponent bits (8 in single precision and 11 in double precision). If I have all zeros in the exponent bits, will I still be able to represent zero even if my mantissa is not all zero?

  2. Wikipedia shows that zero is represented by (−1)signbit×2^{−126}× 0.significandbits Why is it 2^{-126} when the lowest exponent value we can reach is 2^{-127} instead?

  3. Representing denormal numbers: I suppose denormal numbers are represented as this format as well: (−1)signbit×2^{−126}× 0.significandbits. They are used to represent values lower than the smallest normal number. I'm guessing this is 2^{-127}, but if the representation for denormal numbers is as such, wouldn't denormal numbers still represent larger values than normal ones?

  4. normalised numbers: (−1)signbit×2^{exponentbits−127}× 1.significandbits. I'm supposing the actual representation of the exponentbits is in terms of 0 to 255, as they don't represent in two complements form.

  5. plus/minus infinity represented by a full 1 bits in the exponent bits. Again, does a non-zero mantissa matter if we use this representation to signify infinity?


Solution

  • Per IEEE 754-2008:

    • NaN: If the exponent field is all ones and the significand field is not zero, the floating-point datum is a NaN, regardless of the sign field. Preferably, a QNaN has the leading bit of the significand field 1 and a signaling NaN has 0, but this is not required.
    • Infinite: If the exponent field is all ones and the significand field is zero, the datum is (−1)s • ∞, where s is the sign field. (I.e., +∞ if the sign is 0 and −∞ if the sign is 1.)
    • Normal: If the exponent field is neither all zeros nor all ones, the datum is (−1)s • (1 + f • 2q) • 2e - bias, where s is the sign field, f is the significand field, q is the number of bits in the significand field, e is the exponent field, and bias is the exponent bias (127 for 32-bit floating-point).
    • Subnormal: If the exponent field is all zeros, and the significand field is not, the datum is (−1)s • (0 + f • 2q) • 21 - bias. Note the two differences from normal: 0 is added to the significand instead of 1, and 1 is used for the exponent (before subtracting bias). This means subnormals have the same exponent as the smallest normals but are decreased by reducing the significand.
    • Zero: If the exponent field is all zeroes, and the significand field is also all zeros, the datum is (−1)s • 0. (Note that IEEE 754 distinguishes +0 and −0.)

    The exponent used with subnormals is 1 rather than 0 so that the numbers change from (normal) 1.000…000•21−127 to (subnormal) 0.111…111•21−127. If 0 were used, there would be a jump to 0.0111…1111•21−127.

    The formula for the values of subnormals works for zeros too. So zeros do not actually need to be listed separately above.