First of all we should agree on the definition of the QM.N format. I will follow this resource and its conventions.
For the purposes of this paper the notion of a Q-point for a fixed-point number is introduced. This labeling convention is as follows: Q[QI].[QF] Where QI = # of integer bits & QF = # of fractional bits
For signed integer variable types we will include the sign bit in QI as it does have integer weight albeit negative in sign.
Based on this convention, if I had to represent the number -0.123
in the format Q1.7 I would write it as: 1.1110001
The theory says that:
When performing an integer multiplication the product is 2xWL if both the multiplier and multiplicand are WL long. If the integer multiplication is on fixed-point variables, the number of integer and fractional bits in the product is the sum of the corresponding multiplier and multiplicand Q-points as described by the following equations
Knowing this is useful because after multiplication we have double precision, and we need to rescale the output to our input precision. Knowing where the integer part is allows us to prevent overflow and to pick the relevant bits, as in the example where the long string is the result of the multiplication:
However, when performing the multiplication between two Q1.7
numbers of the format 0.xyz
I have noticed that the integer part never grows, allowing me to pick only one bit from the integer part. I have written a piece of code that picks only the fractional part after multiplication, and here are the results.
Test 0
Testing +0.5158*+0.0596
A:real_val:+0.5156 fixed: 66 int: 0 frac: 1000010
B:real_val:+0.0547 fixed: 7 int: 0 frac: 0000111
C: real_val:+0.0282 fixed: 462 int: 00 frac: 00000111001110
Floating multiplication: +0.0307
Test 1
Testing +0.4842*-0.9558
A:real_val:+0.4766 fixed: 61 int: 0 frac: 0111101
B:real_val:-0.9531 fixed: -122 int: 1 frac: 0000110
C: real_val:-0.4542 fixed: -7442 int: 11 frac: 10001011101110
Floating multiplication: -0.4628
Test 2
Testing +0.2812*+0.2433
A:real_val:+0.2734 fixed: 35 int: 0 frac: 0100011
B:real_val:+0.2422 fixed: 31 int: 0 frac: 0011111
C: real_val:+0.0662 fixed: 1085 int: 00 frac: 00010000111101
Floating multiplication: +0.0684
Test 3
Testing -0.7235*-0.9037
A:real_val:-0.7188 fixed: -92 int: 1 frac: 0100100
B:real_val:-0.8984 fixed: -115 int: 1 frac: 0001101
C: real_val:+0.6458 fixed: 10580 int: 00 frac: 10100101010100
Floating multiplication: +0.6538
My question to you is if I am overlooking anything here or if this is normal and expected behaviour from fixed points. If so, I will be happy with my numbers never overflowing during multiplication.
Basically what I mean is that after multiplication of two Q1.X numbers in the form 0.xyz the integer part will always be 0
(if the result is positive) or 1111..
if the result is negative.
So my accumulator register will be filled with only 2*X of meaningful bits and I can take only them, plus the sign.
No, the number of bits in the result is still the sum of the bits in the inputs.
Summary:
Signed Q1.31 times signed Q1.31 equals signed Q2.62.
Unsigned Q1.31 times unsigned Q1.31 equals unsigned Q2.62.
Explanation:
Unsigned Q1.n numbers can represent from zero (inclusive) to two (exclusive). If you multiply two such numbers together the range of results is from zero (inclusive) to 4 (exclusive). Just less than four is three point something, and three fits in the two bits above the point.
Signed Q1.n numbers can represent from negative one (inclusive) to one (exclusive). If you multiply two such numbers together the range of results is negative one (exclusive) to one (inclusive). Signed Q1.31 times signed Q1.31 would fit in Q1.62 except for the single case -1.0 times -1.0 equals +1.0, which requires the extra bit above the point.
The equations in your question apply equally in both these cases.