Search code examples
c++msvcrt

MSVC yields incorrect square root for std::uint64_t at and around max. Bug or expected?


Question: While I was writing some unit tests for integer pairing functions, I noticed a "bug" that keeps cropping up. Both std::sqrt and std::sqrtl yield the same incorrect results near the max std::uint64_t values. Is this is simply to be expected because of floating point rounding error or a bug in microsoft's compiler? If it is to be expected, then is there a way to circumvent this issue without resorting to 128 bit data or using an iterative algorithm to increase precision? Maybe a compiler flag I should be aware of?

Example: With these pairing functions, to unpair an input you have to get the floor of the square root of the input, which I cannot get around the max values. I've abstracted the issue into a simple example for demonstration:

    auto max = std::numeric_limits<std::uint64_t>::max(); //18446744073709551615
    auto square_root = std::floorl(std::sqrtl(max));       //4294967296.0000000
    auto max2 = max - 1;
    auto square_root2 = std::floorl(std::sqrtl(max2));     //4294967296.0000000
    auto max3 = max - 2;
    auto square_root3 = std::floorl(std::sqrtl(max3));     //4294967296.0000000
    auto max4 = max - 5;
    auto square_root4 = std::floorl(std::sqrtl(max4));     //4294967296.0000000
    auto max5 = max / 10;
    auto square_root5 = std::floorl(std::sqrtl(max5));     //1358187913.0000000

According to wolframalpha, the correct values for the first 4 square roots by themselves is as follows

//(respectively)
4294967295.99999999988358467817306518554529727818955797
4294967295.99999999976716935634613037108743911275823190
4294967295.99999999965075403451919555662642550370602178
4294967295.99999999930150806903839111322445201482408715

//To clarify, the floor of 4294967295.99999999930 should be 
//4294967295.0, where the code produces
//4294967296.0

Notes:

  • I should say that I don't anticipate this causing a problem for pairing, this is more of a conceptual question and I'll file a bug if this is not expected behavior.
  • I'm using visual studio 2017 and see this behavior in both x86 and x64 builds

  • More information on pairing functions at: http://szudzik.com/ElegantPairing.pdf (though I'm not looking for a different solution)

Extra information

Thank you to @Phil1970 for the concise answer and explanation- In addition to his answer I'd like to share an interesting excerpt I found from the article shared by @HongOoi as it relates to the question I posted

Brown [1981] has proposed axioms for floating-point that include most of the existing floating-point hardware. However, proofs in this system cannot verify the algorithms of sections Cancellation and Exactly Rounded Operations, which require features not present on all hardware. Furthermore, Brown's axioms are more complex than simply defining operations to be performed exactly and then rounded. Thus proving theorems from Brown's axioms is usually more difficult than proving them assuming operations are exactly rounded.

There is not complete agreement on what operations a floating-point standard should cover. In addition to the basic operations +, -, × and /, the IEEE standard also specifies that square root, remainder, and conversion between integer and floating-point be correctly rounded. It also requires that conversion between internal formats and decimal be correctly rounded (except for very large numbers).

while "correctly rounded (except for very large numbers)" sounds a bit dubious to me, implying that correct rounding is not required by the floating-point specification for large numbers, I believe the behavior experienced in the example is indeed expected. The take away here is that floating-point error and rounding is not limited to subtle nuances in the mantessa working with small numbers past the decimal. It is also important to be aware of the (in)accuracy of large numbers with implicit conversion to and from integer data types.


Solution

  • A std::uint64_t is 64 bits

    In the case of Visual Studio a double or a long double is also 64 bit.

    A double in IEE 754 format has 53 bit for the mantissa: https://en.wikipedia.org/wiki/Double-precision_floating-point_format.

    64 - 53 bit = 11 bit. 11 bit consisting of 1 is 2047. Since number are rounded to nearest double, we get 18446744073709551616 for any number between max - 1023 to max inclusively. And we get 18446744073709549568 for numbers between max - 3070 and max - 1024 inclusively. The difference between these 2 numbers is 2048 (2^11). So a double with a value around 2^64 is precise to ±2048.

    MSVC use 64 bit long double. Other compiler might use 80 bit or even 128: https://en.wikipedia.org/wiki/Long_double.

    I used the following code to test a few offset from max

    void test(std::uint64_t offset)
    {
        auto max = std::numeric_limits<std::uint64_t>::max(); //18446744073709551615
    
        std::cout << "max - " << std::setw(10) << offset << " : " 
            << std::setprecision(20) << static_cast<double>(max - offset) << "\n";
    }
    

    Here is the output for some numbers:

    max -          0 : 18446744073709551616
    max -       1023 : 18446744073709551616
    max -       1024 : 18446744073709549568
    max -       2047 : 18446744073709549568
    max -       2048 : 18446744073709549568
    max -       3070 : 18446744073709549568
    max -       3071 : 18446744073709547520
    

    Added information

    Yes it is expected that converting an integer value will loose some precision if it required more bits that what is allowed by the mantissa of the floating point number. Usually, with double it is not a problem for usual application as one nee a number bigger than 2^53 (9.007e15) to start losing precision when converting to an IEEE 754 double.

    I don't know if conversion from an integer to a double need to be rounded or if it is implementation defined. In the opposite direction, it is always truncated. Where converting text to an double (either at compile-time or run-time), I believe it is always rounded to the nearest representable value.

    Having said that, the result before rounding for the square root like 4294967295.99999999930150806903839111322445201482408715 cannot be represented in a double with enough precision. It would require 20 significant digits to represent 4294967295.9999999993 and a 64 bit double has about 15 significatif digits.

    So what can you do?

    • Ignore the problem if you don't need to handle that much precision. Use a number around 2^50 for your max in unit testing.
    • Use another compiler like Intel or gcc that support larger floating point types.
    • Use a library specifically made for large precision floating point computation.
    • Estimate the square root using some formula and possibly iterations.
    • Convert back the number to a integer. Check if less than 2^32. If so square it and if above original number, substract one.