Search code examples
cfloating-pointprecision

Is there one way to alleviate roundoff errors?


Wikipedia about "guard bits" offers one example codes:

#include <stdio.h>
int main(){
   double a;
   int i;

   a = 0.2; 
   a += 0.1; 
   a -= 0.3;

   for (i = 0; a < 1.0; i++) 
       a += a;

   printf("i=%d, a=%f\n", i, a);

   return 0;
}

With my zen2 r7 4800h cpu, I compiled the above source code Guard_digit.c by gcc Guard_digit.c -std=c17 -march=znver2 -pedantic -O0 -o With_Guard_digit.o. Then it outputs same as wikipedia i=54, a=1.000000.

And as this note says, IEEE standard has implemented the guard digit:

The IEEE standard requires the use of 3 extra bits of less significance than the 24 bits (of mantissa) implied in the single precision representation.

mantissa format plus extra bits:

1.XXXXXXXXXXXXXXXXXXXXXXX   0   0   0                                                                                                                                          

^         ^                 ^   ^   ^
|         |                 |   |   |
|         |                 |   |   -  sticky bit (s)
|         |                 |   -  round bit (r)
|         |                 -  guard bit (g)
|         -  23 bit mantissa from a representation
-  hidden bit

Q: Is there one way to solve with this precision and roundoff problem by changing the source codes or others (i.e. the error offset can be alleviated to some degree so that it may output something like i=108, a=1.000000)?

Edit after viewing the answer by Eric Postpischil:

Sorry for unclearly describing the problem. I want to know how to solve the roundoff problem by keeping the original calculations, so directly a = 0; is not taken in account.

I want to solve this specific problem, but not general. This is beyond my current range just as the comment says.


Solution

  • Q: Is there one way to solve with this precision and roundoff problem by changing the source codes or others (i.e. the error offset can be alleviated to some degree so that it may output something like i=108, a=1.000000)?

    In common C implementations, it is impossible to produce an a by adding and/or subtracting values at or above .0625 that will cause the loop shown to terminate after iterating more than 57 times.

    This is because common C implementations use IEEE-754 binary64, also called “double precision,” for double, and binary64 uses 53 significant bits. This means that a value in the binade starting at .0625 is represented with a significand whose high bit has position value 2−4 (.0625) and whose low bit has position value 2−56 (spanning 53 bits, including both endpoints).

    Adding and subtracting can carry bits to high positions, as taught in elementary school arithmetic, but can never generate non-zero bits below the lowest input position. Therefore, any result produced by adding and subtracting values greater than or equal to .0625 cannot have any non-zero bits below 2−56.

    Therefore, when entering the loop after performing such arithmetic, we have one of the following cases:

    • a is negative or zero, and the loop never terminates.
    • a is 2−56 or greater, and iterating 57 times or fewer will making it greater than 1.

    Is there one way to solve with this precision and roundoff problem by changing the source codes…

    Obviously the correct result of 0.2 + 0.1 - 0.3 can be obtained by changing the source code from:

    a = 0.2; 
    a += 0.1; 
    a -= 0.3;
    

    to:

    a = 0;
    

    This is a common problem in computing: You cannot formally describe the general problem you want to solve by asking “How do I get a solution for these values?”, because then there is simple solution that is just the one answer for those values, and it does not help you generally.

    Instead, you must describe the entire class of problems. For example, you could ask: “How can I write code that finds the exact decimal sum of up 30 positive and negative decimal numerals with at most three digits after the decimal point and two digits before the decimal point?”

    Note further you do not wish to go too far in the other direction, making the problem fully general instead of fully specific. If the problem is to add and subtract any decimal numerals with no error, then you must write arbitrary precision arithmetic. If the problem is to add and subtract some modest number of decimal numerals with some modest number of digits, then the problem may be solvable by using double arithmetic with well-chosen rounding. And specific solutions may depend on the parameters you choose. So you need to characterize the problem well.