Search code examples
floating-pointnumbersprecisionfloating-accuracyinteger-overflow

Floating-point overflows to negative


We know that a signed integer can have an integer overflow where, say, the sign bit is flipped from 0 to 1, causing the positive integer become negative.

Can the same happen with floating-point number? experimentally, when the number is too big, it just becomes Inf. But wouldn't it be possible to overflow the mantissa or the exponent, causing a similar problem?


Solution

  • In the case of IEEE type float (float 32 bit, double 64 bit, long double 80 bit), the numbers are stored similar to sign + magnitude instead of two's complement. The exponent doesn't have a normal range either, with special values for zero or all one bits. Wiki article for double:

    https://en.wikipedia.org/wiki/Double-precision_floating-point_format

    If doing something like radix sort on array of floating types that don't include the special value cases (like infinity, NAN, ...), a conversion from sign and magnitude to a "two's complement" is normally used. Example C macros to convert between 64 bit sign and magnitude to unsigned long long (64 bit unsigned integer) and back. Note that this results in the converted sign and magnitude value for negative zero to be less than that for positive zero.

    // converting doubles to unsigned long long for radix sort or something similar
    // note -0 converted to 0x7fffffffffffffff, +0 converted to 0x8000000000000000
    // -0 is unlikely to be produced by a float operation
    
    #define SM2ULL(x) ((x)^(((~(x) >> 63)-1) | 0x8000000000000000ull))
    #define ULL2SM(x) ((x)^((( (x) >> 63)-1) | 0x8000000000000000ull))