Search code examples
csignal-processingfixed-point

Mapping [-1,+1] floats to Q31 fixed-point


I need to convert float to Q31 fixed-point, Q31 meaning 1 sign bit, 0 bits for integer part, and 31 bits for fractional part. This means that Q31 can only represent numbers in the range [-1,0.9999].

By definition, when converting from float to fixed-point, a multiplication by 2ˇN is done, where N is the fractional part size, in this case 31.

However, I got confused with this code, it doesn't look right, but works:

#define q31_float_to_int(x) ( (int) ( (float)(x)*(float)0x7FFFFFFF ) )

And it seems to work OK. For example:

int a = q31_float_to_int(0.5f); 

gives Hex: 0x40000000, which is OK.

Why is the multipication here done with 2ˇ31 - 1, and not just 2ˇ31?


Solution

  • The code above is not a good solution to convert from float to fixed point. I am guessing whoever wrote the code used the scale factor of 0x7FFFFFFF to avoid an overflow when the input is 1.0. The correct scaling factor is 2^31 and not 2^31 - 1. Note that there are also precision issues when converting a float (with 24 bits of precision) to an Q1.31 (with 31 bits of precision). Consider saturating the input data before multiplication:

    const float Q31_MAX_F =  0x0.FFFFFFp0F;
    const float Q31_MIN_F = -1.0F;
    float clamped = fmaxf(fminf(input, Q31_MAX_F), Q31_MIN_F);
    

    The code above will clamp input to the range of [-1.0, 1.0). The constantQ31_MAX_F is approximately 1 - (2 ^ -24), considering 24-bits of precision, and Q31_MIN_F is -1. Then you can multiply clamped by 2^31, or even better, use scalbnf, or ldexpf:

    int result = (int) scalbnf(clamped, 31);
    

    And if you want rounding:

    int result = (int) roundf(scalbnf(clamped, 31)));