I have a very basic question. In my program, i am doing multiplication of two fixed point numbers, which is given below. My inputs are of Q1.31 format and output also should be of same format. In order to do this, i am storing the result of multiplication in a temporary 64 bit variable and then doing some operations to get the result in required format.
int conversion1(float input, int Q_FORMAT)
{
return ((int)(input * ((1 << Q_FORMAT)-1)));
}
int mul(int input1, int input2, int format)
{
__int64 result;
result = (__int64)input1 * (__int64)input2;//Q2.62 format
result = result << 1;//Q1.63 format
result = result >> (format + 1);//33.31 format
return (int)result;//Q1.31 format
}
int main()
{
int Q_FORMAT = 31;
float input1 = 0.5, input2 = 0.5;
int q_input1, q_input2;
int temp_mul;
float q_muls;
q_input1 = conversion1(input1, Q_FORMAT);
q_input2 = conversion1(input2, Q_FORMAT);
q_muls = ((float)temp_mul / ((1 << (Q_FORMAT)) - 1));
printf("result of multiplication using q format = %f\n", q_muls);
return 0;
}
My question is while converting float input to integer input (and also while converting int output
to float output), i am using (1<<Q_FORMAT)-1 format. But i have seen people using (1<<Q_FORMAT)
directly in their codes. The Problem i am facing when using (1<<Q_FORMAT) is i am getting the
negative of the desired result.
For example, in my program,
If i use (1<<Q_FORMAT), i am getting -0.25 as the result
But, if i use (1<<Q_FORMAT)-1, i am getting 0.25 as the result which is correct.
Where am i going wrong? Do i need to understand any other concepts?
On common platforms, int
is a two’s complement 32-bit integer providing 31 digits (plus a 'sign' bit). It's a bit too narrow to represent a Q1.31 number which requires 32 digits (plus a 'sign' bit).
In your example, this is manifesting as effective arithmetic overflow in the expression, 1 << Q_FORMAT
.
To avoid this, you need to either use a type providing more digits (e.g. long long
) or a fixed-point format requiring fewer digits (e.g. Q1.30). You can use unsigned
to fix your example but the result will be a 'sign' bit short of Q2.30.