Question: While I was writing some unit tests for integer pairing functions, I noticed a "bug" that keeps cropping up. Both std::sqrt and std::sqrtl yield the same incorrect results near the max std::uint64_t values. Is this is simply to be expected because of floating point rounding error or a bug in microsoft's compiler? If it is to be expected, then is there a way to circumvent this issue without resorting to 128 bit data or using an iterative algorithm to increase precision? Maybe a compiler flag I should be aware of?
Example: With these pairing functions, to unpair an input you have to get the floor of the square root of the input, which I cannot get around the max values. I've abstracted the issue into a simple example for demonstration:
auto max = std::numeric_limits<std::uint64_t>::max(); //18446744073709551615
auto square_root = std::floorl(std::sqrtl(max)); //4294967296.0000000
auto max2 = max - 1;
auto square_root2 = std::floorl(std::sqrtl(max2)); //4294967296.0000000
auto max3 = max - 2;
auto square_root3 = std::floorl(std::sqrtl(max3)); //4294967296.0000000
auto max4 = max - 5;
auto square_root4 = std::floorl(std::sqrtl(max4)); //4294967296.0000000
auto max5 = max / 10;
auto square_root5 = std::floorl(std::sqrtl(max5)); //1358187913.0000000
According to wolframalpha, the correct values for the first 4 square roots by themselves is as follows
//(respectively)
4294967295.99999999988358467817306518554529727818955797
4294967295.99999999976716935634613037108743911275823190
4294967295.99999999965075403451919555662642550370602178
4294967295.99999999930150806903839111322445201482408715
//To clarify, the floor of 4294967295.99999999930 should be
//4294967295.0, where the code produces
//4294967296.0
Notes:
I'm using visual studio 2017 and see this behavior in both x86 and x64 builds
More information on pairing functions at: http://szudzik.com/ElegantPairing.pdf (though I'm not looking for a different solution)
Extra information
Thank you to @Phil1970 for the concise answer and explanation- In addition to his answer I'd like to share an interesting excerpt I found from the article shared by @HongOoi as it relates to the question I posted
Brown [1981] has proposed axioms for floating-point that include most of the existing floating-point hardware. However, proofs in this system cannot verify the algorithms of sections Cancellation and Exactly Rounded Operations, which require features not present on all hardware. Furthermore, Brown's axioms are more complex than simply defining operations to be performed exactly and then rounded. Thus proving theorems from Brown's axioms is usually more difficult than proving them assuming operations are exactly rounded.
There is not complete agreement on what operations a floating-point standard should cover. In addition to the basic operations +, -, × and /, the IEEE standard also specifies that square root, remainder, and conversion between integer and floating-point be correctly rounded. It also requires that conversion between internal formats and decimal be correctly rounded (except for very large numbers).
while "correctly rounded (except for very large numbers)" sounds a bit dubious to me, implying that correct rounding is not required by the floating-point specification for large numbers, I believe the behavior experienced in the example is indeed expected. The take away here is that floating-point error and rounding is not limited to subtle nuances in the mantessa working with small numbers past the decimal. It is also important to be aware of the (in)accuracy of large numbers with implicit conversion to and from integer data types.
A std::uint64_t
is 64 bits
In the case of Visual Studio a double
or a long double
is also 64 bit.
A double in IEE 754 format has 53 bit for the mantissa: https://en.wikipedia.org/wiki/Double-precision_floating-point_format.
64 - 53 bit = 11 bit. 11 bit consisting of 1 is 2047. Since number are rounded to nearest double, we get 18446744073709551616
for any number between max - 1023
to max
inclusively. And we get 18446744073709549568
for numbers between max - 3070
and max - 1024
inclusively. The difference between these 2 numbers is 2048 (2^11). So a double with a value around 2^64 is precise to ±2048.
MSVC use 64 bit long double. Other compiler might use 80 bit or even 128: https://en.wikipedia.org/wiki/Long_double.
I used the following code to test a few offset from max
void test(std::uint64_t offset)
{
auto max = std::numeric_limits<std::uint64_t>::max(); //18446744073709551615
std::cout << "max - " << std::setw(10) << offset << " : "
<< std::setprecision(20) << static_cast<double>(max - offset) << "\n";
}
Here is the output for some numbers:
max - 0 : 18446744073709551616
max - 1023 : 18446744073709551616
max - 1024 : 18446744073709549568
max - 2047 : 18446744073709549568
max - 2048 : 18446744073709549568
max - 3070 : 18446744073709549568
max - 3071 : 18446744073709547520
Added information
Yes it is expected that converting an integer value will loose some precision if it required more bits that what is allowed by the mantissa of the floating point number. Usually, with double
it is not a problem for usual application as one nee a number bigger than 2^53 (9.007e15) to start losing precision when converting to an IEEE 754 double.
I don't know if conversion from an integer to a double need to be rounded or if it is implementation defined. In the opposite direction, it is always truncated. Where converting text to an double (either at compile-time or run-time), I believe it is always rounded to the nearest representable value.
Having said that, the result before rounding for the square root like 4294967295.99999999930150806903839111322445201482408715
cannot be represented in a double with enough precision. It would require 20 significant digits to represent 4294967295.9999999993
and a 64 bit double has about 15 significatif digits.
So what can you do?