I need to convert float to Q31 fixed-point, Q31 meaning 1 sign bit, 0 bits for integer part, and 31 bits for fractional part. This means that Q31 can only represent numbers in the range [-1,0.9999]
.
By definition, when converting from float to fixed-point, a multiplication by 2ˇN
is done, where N is the fractional part size, in this case 31.
However, I got confused with this code, it doesn't look right, but works:
#define q31_float_to_int(x) ( (int) ( (float)(x)*(float)0x7FFFFFFF ) )
And it seems to work OK. For example:
int a = q31_float_to_int(0.5f);
gives Hex: 0x40000000
, which is OK.
Why is the multipication here done with 2ˇ31 - 1
, and not just 2ˇ31
?
The code above is not a good solution to convert from float to fixed point. I am guessing whoever wrote the code used the scale factor of 0x7FFFFFFF
to avoid an overflow when the input is 1.0
. The correct scaling factor is 2^31
and not 2^31 - 1
. Note that there are also precision issues when converting a float
(with 24 bits of precision) to an Q1.31
(with 31 bits of precision). Consider saturating the input data before multiplication:
const float Q31_MAX_F = 0x0.FFFFFFp0F;
const float Q31_MIN_F = -1.0F;
float clamped = fmaxf(fminf(input, Q31_MAX_F), Q31_MIN_F);
The code above will clamp input
to the range of [-1.0, 1.0)
. The constantQ31_MAX_F
is approximately 1 - (2 ^ -24)
, considering 24-bits of precision, and Q31_MIN_F
is -1
. Then you can multiply clamped
by 2^31
, or even better, use scalbnf, or ldexpf:
int result = (int) scalbnf(clamped, 31);
And if you want rounding:
int result = (int) roundf(scalbnf(clamped, 31)));