Search code examples
cmatlabprecisionlong-double

Floating point inexact result with not-too-small decimal fraction - is there a way to overcome?


I have a piece of code that repetitively decreases a variable which starts at 1. After two iterations, the result is inexact, which came as a surprise.

The effective process can be shown by the following two lines:

>> (1 - 0.1) - 0.9
0
>> (1 - 0.05 - 0.05) - 0.9
-1.110223024625157e-16

The result in the second case is not zero, in both Matlab and Octave.

Similarly, the following C code shows the problem:

#include <stdio.h>

void main () {
  double x1, x2;
  x1=1-0.05-0.05;
  x2=1-0.1;

  printf("x1 exact:%d, x2 exact:%d\n", x1==0.9, x2==0.9);
}

Compiled with gcc version 4.7.0 on Intel Xeon(R) CPU E5-2680, the result is

x1 exact:0, x2 exact:1

show that the first calculation is inexact while the second is exact.

I can overcome this in C by using long double (suffix "L" for all literals, and the variables declared as long double). This leaves me somewhat confused: I assumed the inexact result can be explained by the fact that 0.05 does not have an exact representation in base 2; but using long double does not change this fact... so why is the result different?

What I really want is a way to overcome it in Matlab. Any ideas?

Interestingly, MS Excel result for this same calculation is exact in both cases.


Solution

  • … the result is … show that the first calculation is inexact while the second is exact.

    This inference is incorrect. The fact that 1 - .05 - .05 == .9 is false and 1 - .1 == .9 is true shows that the rounding errors that occur in computing 1 - .05 - .05 do not produce a result that equals .9, but that the rounding errors that occur in computing 1 - .1 do produce a result that equals .9. Since the floating-point result of using .9 in source code it itself not .9, the fact that 1 - .1 equals .9 does not mean that 1 - .1 is exact. It just means it got the same rounding error as .9.

    In the expression 1 - .1 == .9, there are three rounding errors:

    • When .1 is converted from a decimal numeral in the source code to a binary floating-point value, the result is not .1 but is 0.1000000000000000055511151231257827021181583404541015625.
    • When .9 is converted from a decimal numeral in the source code to a binary floating-point value, the result is 0.90000000000000002220446049250313080847263336181640625.
    • When the former number, 0.1000000000000000055511151231257827021181583404541015625, is subtracted from 1, we can see that, since it is greater than .1, the result ought to be under .9, something like .8999…. But the result is 0.90000000000000002220446049250313080847263336181640625.

    It just so happens that the rounding errors produced the same result on both sides, so the test for equality evaluated to true.

    In computing 1 - .05 - .05, the rounding errors are:

    • .05 in the source code is converted to 0.05000000000000000277555756156289135105907917022705078125.
    • 1 - .05 has the result 0.9499999999999999555910790149937383830547332763671875.
    • 1 - .05 - .05 has the result 0.899999999999999911182158029987476766109466552734375`.

    What has happened here is that, in subtracting 1 - .05, there happened to be a nearby representable value that was slightly less than .95. Thus, when subtracting .05, which we see is slightly greater than .05, we are subtracting something a bit greater than .05 from something a bit less than .95. This combination of errors produces a mathematical result enough under .9 that it is closer to the next lower representable value, 0.899999999999999911182158029987476766109466552734375, than it is to 0.90000000000000002220446049250313080847263336181640625.

    When you do similar calculations with long double, the rounding errors happen to be different. When examining floating-point calculations in these ways, long double will not be better than double in the frequency with which computed results happen to land on the same results you would get with ideal mathematics. It may have happened in your example with the numbers you chose, but using other numbers would produce effectively random results, somewhat as if each computation had a random error up or down one step. If those errors happen to have the same net effect in two different computations, then the two results are equal. If those errors happen to have different net effect, then the two results are not equal.