Search code examples
floating-pointhardwarefpu

Floating point overflow and inexactness


I found some inconsistencies in handling floating point errors on Intel hardware and I'm wondering if this is Intel hardware error or just general way of how floating point aritmetic work. Scenarios:

1) 10000 + maxfloat = 3.40282e+38 error produced: FE_INEXACT

2) maxfloat + maxfloat = inf errors: FE_OVERFLOW, FE_INEXACT

3) 1.1 * maxfloat = inf errors: FE_OVERFLOW, FE_INEXACT

Scenario 1 is inconsisten with other two because I was exceeding float range but I did not get overflow like in case 2 and 3.

I can not comprehend why I don't get overflow and number just saturates in first case but in second and third number is not saturated and I get overflow.

#include <iostream>
#include <limits>
#include <cstdio>
#include <cfenv>

void print_error() {    
    const int err = fetestexcept(FE_ALL_EXCEPT);
    if (err & FE_INVALID) cout << "FE_INVALID " << endl;            
    if (err & FE_DIVBYZERO) cout << "FE_DIVBYZERO "<< endl;
    if (err & FE_OVERFLOW) cout << "FE_OVERFLOW "<< endl;        
    if (err & FE_UNDERFLOW) cout << "FE_UNDERFLOW " << endl;
    if (err & FE_INEXACT) cout << "FE_INEXACT " << endl;
    cout << endl;      
}

int main() {
    feclearexcept(FE_ALL_EXCEPT);        
    cout << numeric_limits<float>::max() + 100000.0f << endl;
    print_error();

    feclearexcept(FE_ALL_EXCEPT);        
    cout << numeric_limits<float>::max() + numeric_limits<float>::max() << endl;
    print_error();    

    feclearexcept(FE_ALL_EXCEPT);
    cout << 1.1f*numeric_limits<float>::max() << endl;
    print_error();
}

Solution

  • Scenario 1 is inconsistent with other two because I was exceeding float range but I did not get overflow like in case 2 and 3.

    The sum 10000 + maxfloat is not exactly representable, hence FE_INEXACT. Instead the sum was rounded. Rounding choices include the largest finite number maxfloat and the next largest finite number "as if" it could be represented with additional exponent range. With round to nearest, the sum rounded to maxfloat as that is closer.

    In cases 2 & 3, the sum rounded to or above this next largest finite "as if" number. Since the rounded sum meets/exceeds this number, infinity is returned.


    Below is a number line showing the last 3 finite float including FLT_MAX.
    Had float had further exponent range, the next 2 numbers after FLT_MAX would have been the 2 on the right: 'FLT_MAX next "as if"' and unnamed.
    "Half-way" is between FLT_MAX and that next largest finite "as if" number.

    When the sum is more than FLT_MAX, but less than "Half-way", a round-to nearest results in FLT_MAX (Case 1). When the sum is greater, the result is infinity. (Case 2,3).

    enter image description here