Search code examples
pythonc++floating-pointieee-754

My Floating-point problem - Trial in C++/Python


In what follows, IEEE-754 Double-precision floating-point format is taken for granted to be used.

Python: "...almost all machines use IEEE 754 binary floating-point arithmetic, and almost all platforms map Python floats to IEEE 754 binary64 'double precision' values."

C++: associates double with IEEE-754 double-precision.

Machine epsilon is s.t. fl(1+epsilon)>1.

Using double precision, formally epsilon = 2^-52, but mostly implemented (because of rounding-to-nearest) as epsilon = 2^-53.

In Python (Spyder), when I do:

i = 1.0 + 2**-53

print(i)

>> 1.0

C++ version:

#include <iostream>
#include <cmath>
#include <iomanip>
    
int main() {
        
    double i = std::pow(2.0, -53);
        
    double j = 1.0 + i;
        
    std::cout << j;
    
    return 0;
        
}

>> 1.0

Furthermore, when I do in Python:

i = 1.0 + 2**-52 + 2**-53

print(i)

>> 1.0000000000000004

C++ version:

#include <iostream>
#include <cmath>
#include <iomanip>
        
int main() {
                
    double i = std::pow(2.0, -52);
                
    double j = 1.0 + i;
                
    double k = j + std::pow(2.0, -53);
                
    std::cout << std::setprecision(17) << k;
        
    return 0;

}
    
>> 1.0000000000000004

where the order of addition is from left to right so that the magic of the non-associativity does not come into play, i.e.

i = 1.0 + 2**-52 + 2**-53 <=> i = (1.0 + 2**-52) + 2**-53.

The weird (to me) thing happens here. Firstly, 1 + 2**-52 is stored (in registers) as exactly the Double Precision format equivalent of 1 + 2*epsilon, i.e.

0 + 01111111111 + 000...01

Moreover, 2**-53 is stored as:

0 + 01111001010 + 000...00

If we write these in the form (mantissa)*(2**exponent), 1 + 2**-52 is (1.0000...01)*(2**0), and 2**-53 is (1.0000...00)*(2**-53), where the mantissa (1.xxxx...xx) length is 53 (including implicit 1 bit). A bitwise addition requires the same exponent, thus I shift the mantissa of the smaller one (i.e. 2**-53) so that it has exponent 0:

(1.0000...00)*(2**-53) -> (0.|52 zeros here|10000...00)*(2**0)

So that the addition is like:

1.0000...010 (51 zero digits after the radix, then a '1' and '0' making a total of 53 digits)
0.0000...001 (52 zero digits after the radix + '1' digit making a total of 53 digits)
+
------------
1.0000...011 (53 digits after the radix)

Now, mantissa should be 53 in total, but it is 54 above, thus it should be rounded.

My question begins here: Is the reason why both of these programming languages give 1.0000000000000004 output when I do 1.0 + 2**-52 + 2**-53 because tie-to-even rule is implemented so that 1.0000...011 is rounded to 1.0000...10, which is esentially 1.0000000000000004 up to 16 digits precision? or is it (completely) something else & I made a mistake in the calculations etc.?

Sorry if it seems as a bit of overkill or overthinking on a simple subject like this, however, it bothers me for days and I could not figure out the reason why or could not verify my thoughts. Any answer and comment is appreciated.


Solution

  • Yes, it's entirely about round-to-nearest/even. Python's float.hex() can show you the bits directly:

    >>> 1.0 + 2**-52 + 2**-53
    1.0000000000000004
    >>> _.hex()
    '0x1.0000000000002p+0'
    

    Although Python and C have little to do with this: it's almost certainly a result of how your CPU/FPU implement float addition (they almost certainly implement 754-style nearest/even rounding in hardware by default). `` Note too that associativity doesn't matter in this specific example. 1.0 + 2**-52 and 2**-52 + 2**-53 are both exactly representable on their own.