Search code examples
c++visual-c++floating-pointdoubleunsigned-long-long-int

Bizarre floating-point behavior with vs. without extra variables, why?


When I run the following code in VC++ 2013 (32-bit, no optimizations):

#include <cmath>
#include <iostream>
#include <limits>

double mulpow10(double const value, int const pow10)
{
    static double const table[] =
    {
        1E+000, 1E+001, 1E+002, 1E+003, 1E+004, 1E+005, 1E+006, 1E+007,
        1E+008, 1E+009, 1E+010, 1E+011, 1E+012, 1E+013, 1E+014, 1E+015,
        1E+016, 1E+017, 1E+018, 1E+019,
    };
    return pow10 < 0 ? value / table[-pow10] : value * table[+pow10];
}

int main(void)
{
    double d = 9710908999.008999;
    int j_max = std::numeric_limits<double>::max_digits10;
    while (j_max > 0 && (
        static_cast<double>(
            static_cast<unsigned long long>(
                mulpow10(d, j_max))) != mulpow10(d, j_max)))
    {
        --j_max;
    }
    double x = std::floor(d * 1.0E9);
    unsigned long long y1 = x;
    unsigned long long y2 = std::floor(d * 1.0E9);
    std::cout
        << "x == " << x << std::endl
        << "y1 == " << y1 << std::endl
        << "y2 == " << y2 << std::endl;
}

I get

x  == 9.7109089990089994e+018
y1 == 9710908999008999424
y2 == 9223372036854775808

in the debugger.

I'm mindblown. Can someone please explain to me how the heck y1 and y2 have different values?


Update:

This only seems to happen under /Arch:SSE2 or /Arch:AVX, not /Arch:IA32 or /Arch:SSE.


Solution

  • You are converting out-of-range double values to unsigned long long. This is not allowed in standard C++, and Visual C++ appears to treat it really badly in SSE2 mode: it leaves a number on the FPU stack, eventually overflowing it and making later code that uses the FPU fail in really interesting ways.

    A reduced sample is

    double d = 1E20;
    unsigned long long ull[] = { d, d, d, d, d, d, d, d };
    if (floor(d) != floor(d)) abort();
    

    This aborts if ull has eight or more elements, but passes if it has up to seven.

    The solution is not to convert floating point values to an integer type unless you know that the value is in range.

    4.9 Floating-integral conversions [conv.fpint]

    A prvalue of a floating point type can be converted to a prvalue of an integer type. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type. [ Note: If the destination type is bool, see 4.12. -- end note ]

    The rule that out-of-range values wrap when converted to an unsigned type only applies if the value as already of some integer type.

    For whatever it's worth, though, this doesn't seem like it's intentional, so even though the standard permits this behaviour, it may still be worth reporting this as a bug.