Search code examples
c++mathfloating-pointieee-754

How computer does floating point arithmetic?


I have seen long articles explaining how floating point numbers can be stored and how the arithmetic of those numbers is being done, but please briefly explain why when I write

cout << 1.0 / 3.0 <<endl;

I see 0.333333, but when I write

cout << 1.0 / 3.0 + 1.0 / 3.0 + 1.0 / 3.0 << endl;

I see 1.

How does the computer do this? Please explain just this simple example. It is enough for me.


Solution

  • The problem is that the floating point format represents fractions in base 2.

    The first fraction bit is ½, the second ¼, and it goes on as 1 / 2n.

    And the problem with that is that not every rational number (a number that can be expressed as the ratio of two integers) actually has a finite representation in this base 2 format.

    (This makes the floating point format difficult to use for monetary values. Although these values are always rational numbers (n/100) only .00, .25, .50, and .75 actually have exact representations in any number of digits of a base two fraction. )

    Anyway, when you add them back, the system eventually gets a chance to round the result to a number that it can represent exactly.

    At some point, it finds itself adding the .666... number to the .333... one, like so:

      00111110 1  .o10101010 10101010 10101011
    + 00111111 0  .10101010 10101010 10101011o
    ------------------------------------------
      00111111 1 (1).0000000 00000000 0000000x  # the x isn't in the final result
    

    The leftmost bit is the sign, the next eight are the exponent, and the remaining bits are the fraction. In between the exponent and the fraction is an assummed "1" that is always present, and therefore not actually stored, as the normalized leftmost fraction bit. I've written zeroes that aren't actually present as individual bits as o.

    A lot has happened here, at each step, the FPU has taken rather heroic measures to round the result. Two extra digits of precision (beyond what will fit in the result) have been kept, and the FPU knows in many cases if any, or at least 1, of the remaining rightmost bits were one. If so, then that part of the fraction is more than 0.5 (scaled) and so it rounds up. The intermediate rounded values allow the FPU to carry the rightmost bit all the way over to the integer part and finally round to the correct answer.

    This didn't happen because anyone added 0.5; the FPU just did the best it could within the limitations of the format. Floating point is not, actually, inaccurate. It's perfectly accurate, but most of the numbers we expect to see in our base-10, rational-number world-view are not representable by the base-2 fraction of the format. In fact, very few are.