Search code examples
c++visual-c++c++11type-conversionnumeric-limits

How check properly, if a `double` or `long double` fits into a `long long`?


It seems quite simple at the first thought, but I couldn't find any good descriptions covering this case.

I have a method which returns a 64bit value. The value is internally calculated using long double values. At the end of the method I would like to check, if the long double is in the range of the long long value, and otherwise just assign the maximum long long value.

I use the following code, which only checks the positive range, because there are no negative results:

long long calculateSomething()
{
    long double calculatedValue = ...;

    long long result;
    if (calculatedValue > static_cast<long double>(std::numeric_limits<long long>::max())) {
        result = std::numeric_limits<long long>::max();
    } else {
        result = static_cast<long long>(std::floor(calculatedValue));
    }

    return result;
}

Now I wonder, long double can equal to a double. Will the conversion static_cast<long double>(std::numeric_limits<long long>::max()) always work correctly?

Or is there another better way to check the range?


Solution

  • The conversion between floating point and integral types are specified in §4.9 [conv.fpint] of the C++ standard:

    1 A prvalue of a floating point type can be converted to a prvalue of an integer type. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type. [ Note: If the destination type is bool, see 4.12. —end note ]

    2 A prvalue of an integer type or of an unscoped enumeration type can be converted to a prvalue of a floating point type. The result is exact if possible. If the value being converted is in the range of values that can be represented but the value cannot be represented exactly, it is an implementation-defined choice of either the next lower or higher representable value. [ Note: Loss of precision occurs if the integral value cannot be represented exactly as a value of the floating type. —end note ] If the value being converted is outside the range of values that can be represented, the behavior is undefined. If the source type is bool, the value false is converted to zero and the value true is converted to one.

    A typical long long is 64 bits, so std::numeric_limits<long long>::max() is 263-1. This is much smaller than the smallest possible value for LDBL_MAX, which is 1E+37. Thus, we are safely within the representable range for a long double. However, if the long double is 64 bits, 263-1 is highly unlikely to be exactly representable, and you run into trouble because the standard says that the result is "an implementation-defined choice of either the next lower or higher representable value". In other words, it can go either way, and there you have a problem.

    If the compiler picked the next lower representable value for the conversion, then all is well. Even if calculatedValue == static_cast<long double>(std::numeric_limits<long long>::max()), it's still in the representable range of long long and the conversion is well-defined.

    If the compiler picked the next higher representable value for the conversion (and round-to-nearest, the typical rounding used, will likely go this way since 263 is exactly representable), and calculatedValue == static_cast<long double>(std::numeric_limits<long long>::max()), then calculatedValue is actually outside the representable range of a long long, but in your code you still try to cast it to a long long. So, per the first paragraph above, you have undefined behavior. Ouch.

    The simplest fix is to test for calculatedValue >= static_cast<long double>(std::numeric_limits<long long>::max()) instead of calculatedValue > static_cast<long double>(std::numeric_limits<long long>::max()). In the unlikely event that the compiler rounds down, you'll miss one case. Another possible fix is to test for calculatedValue >= static_cast<long double>(std::numeric_limits<long long>::max() + 1ULL), taking advantage of the fact that 2n for reasonable ns are exactly representable in floating point.