Search code examples
cfloating-pointintegeroverflowunderflow

Dealing with overflow and underflow in integer and floating-point multiplication


I'm dealing with a problem where I have a huge collection of 16-bit signed integers. I must multiply each by a constant factor of type double and promote convert it back to the integer type. I need to check for both overflow and underflow, and in case of such occurrences, the result should be set to the integer limit, and a warning should be printed informing the user about the problem.

I came up with a solution using C but didn't like it too much. I'm also questioning the efficiency of the check. Basically, I'm looking for a "best practice" approach.

int16_t *sample = malloc(2);

for (unsigned long i = 1; fread(sample, 2, 1, inputf); i++)
{
    // Check overflow/underflow
    if (fabs(INT16_MAX / (double) *sample) < fabs(factor))
    {
        if (copysign(1.0, factor) * *sample > 0)
        {
            printf("Overflow in sample #%li\n", i);
            *sample = INT16_MAX;
        }
        else
        {
            printf("Underflow in sample #%li\n", i);
            *sample = INT16_MIN;
        }
    }
    else
    {
        *sample = (int16_t) (*sample * factor);
    }
    /* Do stuff with sample */
}

free(sample);

I have already considered saturation/wrapping of integers, but according to this question, relying on saturation or wrapping is considered undefined behavior for signed integers. I also came across this question, but gosh.


Solution

  • 16-Bit Integer

    The range for int16_t is [−32,768, +32,767], since it is a 16-bit two’s complement type (and these limits are specified in C 2018 7.20.2.1 1). When a floating-point value is converted to an integer type, it is truncated (6.3.1.4 1), so all values inside the open interval (−32,769, +32,768) have defined results (do not overflow), and values outside that overflow (the C standard does not define the behavior).

    The C standard’s requirements for float are such that all integers from −32,769 to +32,768 can be represented. Per 5.2.4.2.2 14, the spacing between representable numbers in a neighborhood of 1 is at most 10− 5, so the spacing in the neighborhood of 32,769 is at most 32,769•10−5 = .32769. Further, 5.2.4.2.2 13 tells us this is within the range of the float format (at least 1037). So integers up to 32,769 in magnitude are representable. The float values are a subset of the double values per 6.2.5 10.

    Therefore, we can perform the desired scaling, test, and conversion with:

    float t = *sample * factor;
    if (t <= INT16_MIN - 1.f)
    {
        fprintf(stderr, "Warning, underflow in sample #%lu.\n", i);
        *sample = INT16_MIN;
    }
    else if (INT16_MAX + 1.f <= t)
    {
        fprintf(stderr, "Warning, overflow in sample #%lu.\n", i);
        *sample = INT16_MAX;
    }
    else
        *sample = t;
    

    The above uses float since it suffices for range, but double may be used if more precision is desired in the representation of factor or the product.

    Wider Integer Types

    The IEEE-754 binary64 format commonly used for double is sufficient, as its 53-bit significand ensures integers up to 253 can be represented.

    However, the C standard alone does not guarantee this. It only guarantees spacing of 10−9 in the neighborhood of 1 and a 32-bit int may have values up to 2,147,483,647, this is insufficient to guarantee the test may be performed in the same way as above. For general portable conversion of floating-point values to integer types, this answer provides safe code.