Search code examples
c++floating-pointprecision

Convert max integer to floating point without precision loss


I have a following snippet which I used to try to test if a precision loss occurs if I convert a max integer to a double:

#include <cstdint>
#include <limits>
#include <iostream>
#include <iomanip>

int main () noexcept
{
    uint64_t ui64{std::numeric_limits<uint64_t>::max()};
    constexpr auto max_precision{std::numeric_limits<long double>::digits10 + 1}; 
    std::cout << "ui64 " << std::setprecision(max_precision) << std::boolalpha << ui64 << "\n\n";

    double f64 = static_cast<double>(ui64);
    uint64_t ui64_cast_back = static_cast<uint64_t>(f64);
    std::cout << "sizeof(f64): " << sizeof(double) << std::endl;
    std::cout << "f64 = " << f64 << std::endl;
    std::cout << "ui64_cast_back matches original value? " << (ui64_cast_back == ui64) << std::endl;
    std::cout << "ui64_cast_back = " << ui64_cast_back << std::endl;
}

When I build this for the custom platform of my project (not available on Compiler Explorer) I get the following output:

ui64 18446744073709551615

sizeof(f64): 8
f64 = 18446744073709551616
ui64_cast_back matches original value? true
ui64_cast_back = 18446744073709551615

Off by one error when printing the double value seems to suggest that there is a precision loss. Yet when casting back, the original value is retrieved. Is it possible that the off by one is caused by the IO streams implementation when printing, or the printing can be considered a proof of precision loss?


Solution

  • It looks like double on your platform is binary64.

    There is no 18446744073709551615 in that format. The two values closest are:

    18446744073709549568  // The double below
    18446744073709551615  // Your integer (not exactly representable)
    18446744073709551616  // The double above
    

    So double f64 = static_cast<double>(ui64); rounds the value to the closest double (the one above). There is no way to prevent "precision loss" since it physically can't be represented. (Consider that there are 2^64 int64_t values, and at most 2^64 double values).

    You should notice you get the same results for std::numeric_limits<uint64_t>::max()-1 or std::numeric_limits<uint64_t>::max()-2 etc. because they also round to the same value.

    When you do the cast back, you have undefined behaviour because the double is now too large to fit in an int64_t. Your machine seems to simply return the largest int64_t if the double is too large.