Search code examples
c++floating-pointfloating-accuracysubtractionrounding-error

c++ Floating point subtraction error and absolute values


The way I understand it is: when subtracting two double numbers with double precision in c++ they are first transformed to a significand starting with one times 2 to the power of the exponent. Then one can get an error if the subtracted numbers have the same exponent and many of the same digits in the significand, leading to loss of precision. To test this for my code I wrote the following safe addition function:

double Sadd(double d1, double d2, int& report, double prec) {
    int exp1, exp2;
    double man1=frexp(d1, &exp1), man2=frexp(d2, &exp2);
    if(d1*d2<0) {
        if(exp1==exp2) {
            if(abs(man1+man2)<prec) {
                cout << "Floating point error" << endl;
                report=0;
            }
        }
    }
    return d1+d2;
}

However, testing this I notice something strange: it seems that the actual error (not whether the function reports error but the actual one resulting from the computation) seems to depend on the absolute values of the subtracted numbers and not just the number of equal digits in the significand...

For examples, using 1e-11 as the precision prec and subtracting the following numbers:

1) 9.8989898989898-9.8989898989897: The function reports error and I get the highly incorrect value 9.9475983006414e-14

2) 98989898989898-98989898989897: The function reports error but I get the correct value 1

Obviously I have misunderstood something. Any ideas?


Solution

  • If you subtract two floating-point values that are nearly equal, the result will mostly reflect noise in the low bits. Nearly equal here is more than just same exponent and almost the same digits. For example, 1.0001 and 1.0000 are nearly equal, and subtracting them could be caught by a test like this. But 1.0000 and 0.9999 differ by exactly the same amount, and would not be caught by a test like this.

    Further, this is not a safe addition function. Rather, it's a post-hoc check for a design/coding error. If you're subtracting two values that are so close together that noise matters you've made a mistake. Fix the mistake. I'm not objecting to using something like this as a debugging aid, but please call it something that implies that that's what it is, rather than suggesting that there's something inherently dangerous about floating-point addition. Further, putting the check inside the addition function seems excessive: an assert that the two values won't cause problems, followed by a plain old floating-point addition, would probably be better. After all, most of the additions in your code won't lead to problems, and you'd better know where the problem spots are; put asserts in the problems spots.