Search code examples
floating-pointieee-754comparison-operators

Why do IEEE 754 floating-point numbers use a sign bit of 1 for negative numbers?


The typical reason given for using a biased exponent (also known as offset binary) in floating-point numbers is that it makes comparisons easier.

By arranging the fields such that the sign bit takes the most significant bit position, the biased exponent takes the middle position, then the significand will be the least significant bits and the resulting value will be ordered properly. This is the case whether or not it is interpreted as a floating-point or integer value. The purpose of this is to enable high speed comparisons between floating-point numbers using fixed-point hardware.

However, because the sign bit of IEEE 754 floating-point numbers is set to 1 for negative numbers and 0 for positive numbers, the integer representation of negative floating-point numbers is greater than that of the positive floating-point numbers. If this were reversed, then this would not be the case: the value of all positive floating-point numbers interpreted as unsigned integers would be greater than all negative floating-point numbers.

I understand this wouldn't completely trivialize comparisons because NaN != NaN, which must be handled separately (although whether or not this is even desirable is questionable as discussed in that question). Regardless, it's strange that this is the reason given for using a biased exponent representation when it is seemingly defeated by the specified values of the sign and magnitude representation.

There is more discussion on the questions "Why do we bias the exponent of a floating-point number?" and "Why IEEE floating point number calculate exponent using a biased form?" From the first, the accepted answer even mentions this (emphasis mine):

The IEEE 754 encodings have a convenient property that an order comparison can be performed between two positive non-NaN numbers by simply comparing the corresponding bit strings lexicographically, or equivalently, by interpreting those bit strings as unsigned integers and comparing those integers. This works across the entire floating-point range from +0.0 to +Infinity (and then it's a simple matter to extend the comparison to take sign into account).

I can imagine two reasons: first, using a sign bit of 1 for negative values allows the definition of IEEE 754 floating-point numbers in the form -1s x 1.fe-b; and second, the floating-point number corresponding to a bit string of all 0s is equal to +0 instead of -0.

I don't see either of these as being meaningful especially considering the common rationale for using a biased exponent.


Solution

  • I found the reference "Radix Tricks" on the Wikipedia article for the IEEE 754 standard, where in the section titled "Floating point support" the author describes the steps necessary to compare two floating-point numbers as unsigned 2's complement integers (specifically, 32-bit IEEE 754 single-precision floating-point numbers).

    In it, the author points out that simply flipping the sign bit is insufficient because the encoded significand of a large (higher magnitude) negative number interpreted as an unsigned integer will be greater than that of a smaller negative number, when of course a larger negative number should be lesser than a smaller one. Similarly, a negative number with a larger biased exponent is actually less than one with a smaller biased exponent, such that negative numbers with the unbiased exponent emax are less than those with the unbiased exponent emin.

    In order to correct for this, the sign bit should be flipped for positive numbers, and all bits should be flipped for negative numbers. The author presents the following algorithm:

    uint32_t cmp(uint32_t f1, uint32_t f2)
    {
        uint32_t f1 = f1 ^ (-(f1 >> 31) | 0x80000000);
        uint32_t f2 = f2 ^ (-(f2 >> 31) | 0x80000000);
        return f1 < f2;
    }
    

    The purpose in explaining this is to clarify that inverting the sign bit does not make it possible to directly compare finite floating-point numbers as unsigned 2's complement integers. On the contrary, using sign and magnitude hardware (which must interpret the sign bit as a sign bit, and not as a part of an unsigned integer) requires no additional bitwise operations and should therefore result in the simplest, smallest, and most efficient design.

    It is possible to create a floating-point format encoding that uses 2's complement, and it has been studied as detailed in this paper. However, this is far beyond the scope of the question and involves many additional complexities and problems to be solved. Perhaps there is a better way, but the IEEE 754 design has the advantage that it is obviously satisfactory for all use cases.