Search code examples
c++floating-pointsseglibchypotenuse

Libc hypot function seems to return incorrect results for double type... why?


#include <tgmath.h>
#include <iostream>
int main(int argc, char** argv) {

        #define NUM1 -0.031679909079365576
        #define NUM2 -0.11491794452567111

        std::cout << "double precision :"<< std::endl;
        typedef std::numeric_limits< double > dbl;
        std::cout.precision(dbl::max_digits10);
        std::cout << std::hypot((double)NUM1, (double)NUM2);
        std::cout << " VS sqrt :" << sqrt((double )NUM1*(double )NUM1 
                                  + (double )NUM2*(double )NUM2) << std::endl;

        std::cout << "long double precision :"<< std::endl;
        typedef std::numeric_limits<long double > ldbl;
        std::cout.precision(ldbl::max_digits10);
        std::cout << std::hypot((long double)NUM1, (long double)NUM2);
        std::cout << " VS sqrt :" << sqrt((long double )NUM1*(long double )NUM1 + (long double )NUM2*(long double )NUM2);
}

Returns under Linux (Ubuntu 18.04 clang or gcc, whatever optimisation, glic 2.25):

double precision : 0.1192046585217293 VS sqrt :0.11920465852172932

long double precision : 0.119204658521729311251 VS sqrt :0.119204658521729311251

According to the cppreference :

Implementations usually guarantee precision of less than 1 ulp (units in the last place): GNU, BSD, Open64 std::hypot(x, y) is equivalent to std::abs(std::complex(x,y)) POSIX specifies that underflow may only occur when both arguments are subnormal and the correct result is also subnormal (this forbids naive implementations)

So, hypot((double)NUM1, (double)NUM2) should return 0.11920465852172932, i suppose (as naive sqrt implementation). On windows, using MSVC 64 bit, this is the case.

Why do we see this difference using glibc ? How is it possible to solve this inconsistency ?


Solution

    • 0.11920465852172932 is represented by 0x1.e84324de1b576p-4 (as a double)
    • 0.11920465852172930 is represented by 0x1.e84324de1b575p-4 (as a double)
    • 0.119204658521729311251 is the long-double result, which we can assume is correct to a couple more decimal places. i.e. the exact result is closer to rounded up result.

    Those FP bit-patterns differ only in the low bit of the mantissa (aka significand), and the exact result is between them. So they each have less than 1 ulp of rounding error, achieving what typical implementations (including glibc) aim for.

    Unlike IEEE-754 "basic" operations (add/sub/mul/div/sqrt), hypot is not required to be "correctly rounded". That means <= 0.5 ulp of error. Achieving that would be much slower for operations the HW doesn't provide directly. (e.g. do extended-precision calculation with at least a couple extra definitely-correct bits, so you can round to the nearest double, like the hardware does for basic operations)

    It happens that in this case, the naive calculation method produced the correctly-rounded result while glibc's "safe" implementation of std::hypot (that has to avoid underflow when squaring small numbers before adding) produced a result with >0.5 but <1 ulp of error.


    You didn't specify whether you were using MSVC in 32-bit mode.

    Presumably 32-bit mode would be using x87 for FP math, giving extra temporary precision. Although some MSVC versions' CRT code sets the x87 FPU's internal precision to round to 53-bit mantissa after every operation, so it behaves like SSE2 using actual double, except with a wider exponent range. See Bruce Dawson's blog post.

    So I don't know if there's any reason beyond luck that MSVC's std::hypot got the correctly-rounded result for this.

    Note that long double in MSVC is the same type as 64-bit double; that C++ implementation doesn't expose x86 / x86-64's 80-bit hardware extended-precision type. (64-bit mantissa).