Search code examples
c++floating-pointroundinginfinityrounding-error

Floating point multiplication results in infinity even though rounding to nearest


Consider the following C++ code:

#include <fenv.h>
#include <iostream>
using namespace std;

int main(){
    fesetround(FE_TONEAREST);
    double a = 0x1.efc7f0001p+376;
    double b = -0x1.0fdfdp+961;
    double c = a*b;
    cout << a << " "<< b << " " << c << endl;
}

The output that I see is

2.98077e+113 -2.06992e+289 -inf

I do not understand why c is infinity. My understanding is that whatever the smallest non-infinity floating point value is, it should be closer to the actual value of a*b than -inf as the minimum non-infinity floating point number is finite and any finite number is closer to any other finite number than negative infinity. Why is infinity outputted here?

This was run on 64bit x86 and the assembly uses SSE instructions. It was compiled with -O0 and happens both with clang and gcc.

The result is the minimum finite floating point if the round towards zero mode is used for floating points. I conclude that the issue is rounding related.


Solution

  • The behaviour you note (when using the FE_TONEAREST rounding mode) conforms to the IEEE-754 (or the equivalent ISO/IEC/IEEE 60559:2011) Standard. (Your examples imply that your platform uses IEEE-754 representation, as many – if not most – platforms do, these days.)

    From this Wikipedia page, footnote #18, which cites IEEE-754 §4.3.1, dealing with the "rounding to nearest" modes:

    In the following two rounding-direction attributes, an infinitely precise result with magnitude at least bemax(b-½b(1-p)) shall round to (infinity) with no change in sign.

    The 'infinitely precise' result of your a * b calculation does, indeed, have a magnitude greater than the specified value, so the rounding is Standard-conformant.


    I can't find a similar IEEE-754 citation for the "round-towards-zero" mode, but this GNU libc manual has this to say:

    Round toward zero.
    All results are rounded to the largest representable value whose magnitude is less than that of the result. In other words, if the result is negative it is rounded up; if it is positive, it is rounded down.

    So, again, when using that mode, rounding to -DBL_MAX is appropriate/conformant.