Search code examples
floating-pointroundingmsp430

What's an efficient way to round a signed single precision float to the nearest integer?


float input = whatever;
long output = (long)(0.5f + input);

This is inefficient for my application on an MSP430, using the compiler-supplied floating point addition support library.

I'm musing that there may likely be a clever 'trick' to this particular kind of 'nearest integer' rounding, avoiding plain floating point addition perhaps by 'bit twiddling' the floating point representation directly, but I have yet to find suchlike. Can anyone suggest such a trick for rounding IEEE 754 32bit floats?


Solution

  • Conversion by bit operations is straightforward, and is demonstrated by the C code below. Based on the comment about the data types on the MSP430, the code assumes that int comprises 16 bits, and long 32 bits.

    We need a means of transferring the bit pattern of the float to an unsigned long as efficiently as possible. This implementation uses a union for this, your platform may have more efficient machine-specific ways, e.g. an intrinsic. In the worst case, use memcpy() to copy the bytes.

    There are just a few cases to distinguish. We can examine the exponent field of the float input to tease them apart. If the argument is too large, or a NaN, the conversion fails. One convention is to return the smallest negative integer operand in this case. If the input is less than 0.5, the result is zero. After eliminating these special cases we are left with those inputs that require a small bit of computation to convert.

    For sufficiently large arguments, a float is always an integer, in those case we just need to shift the mantissa pattern to the correct bit position. If the input is too small to be an integer, we convert to a 32.32 fixed-point format. The rounding is then based on the most significant fraction bit, and in the case of a tie, on the least significant integer bit as well, since ties must be rounded to even.

    If tie cases are supposed to always round away from zero, the rounding logic in the code simplifies to

    r = r + (t >= 0x80000000UL);
    

    Below is the float_to_long_round_nearest() that implements the approach discussed above, along with a test framework that tests this implementation exhaustively.

    #include <stdio.h>
    #include <stdlib.h>
    #include <math.h>
    
    long float_to_long_round_nearest (float a)
    {
        volatile union {
            float f;
            unsigned long i;
        } cvt;
        unsigned long r, ia, t, expo;
    
        cvt.f = a;
        ia = cvt.i;
        expo = (ia >> 23) & 0xff;
        if (expo > 157) {        /* magnitude too large (>= 2**31) or NaN */
            r = 0x80000000UL;
        } else if (expo < 126) { /* magnitude too small ( < 0.5) */
            r = 0x00000000UL;
        } else {
            int shift = expo - 150;
            t = (ia & 0x007fffffUL) | 0x00800000UL;
            if (expo >= 150) {   /* argument is an integer, shift left */
                r = t << shift;
            } else {
                r = t >> (-shift);
                t = t << (32 + shift);
                /* round to nearest or even */
                r = r + ((t > 0x80000000UL) | ((t == 0x80000000UL) & (r & 1)));
            }
            if ((long)ia < 0) {  /* negate result if argument negative */
                r = -(long)r;
            }
        }
        return (long)r;
    }
    
    long reference (float a) 
    {
        return (long)rintf (a);
    }
    
    int main (void)
    {
         volatile union {
            float f;
            unsigned long i;
        } arg;
         long res, ref;
    
         arg.i = 0x00000000UL;
         do {
             res = float_to_long_round_nearest (arg.f);
             ref = reference (arg.f);
             if (res != ref) {
                 printf ("arg=%08lx % 15.8e  res=%08lx  ref=%08lx\n", 
                         arg.i, arg.f, res, ref);
                 return EXIT_FAILURE;
             }
             arg.i++;
         } while (arg.i);
         return EXIT_SUCCESS;
    }