Search code examples
embeddedfixed-point

Integer part bit growth for fixed point numbers of the 0.xyz kind


First of all we should agree on the definition of the QM.N format. I will follow this resource and its conventions.

For the purposes of this paper the notion of a Q-point for a fixed-point number is introduced. This labeling convention is as follows: Q[QI].[QF] Where QI = # of integer bits & QF = # of fractional bits

For signed integer variable types we will include the sign bit in QI as it does have integer weight albeit negative in sign.

Based on this convention, if I had to represent the number -0.123 in the format Q1.7 I would write it as: 1.1110001

The theory says that:

When performing an integer multiplication the product is 2xWL if both the multiplier and multiplicand are WL long. If the integer multiplication is on fixed-point variables, the number of integer and fractional bits in the product is the sum of the corresponding multiplier and multiplicand Q-points as described by the following equations

enter image description here

Knowing this is useful because after multiplication we have double precision, and we need to rescale the output to our input precision. Knowing where the integer part is allows us to prevent overflow and to pick the relevant bits, as in the example where the long string is the result of the multiplication:

enter image description here

However, when performing the multiplication between two Q1.7 numbers of the format 0.xyz I have noticed that the integer part never grows, allowing me to pick only one bit from the integer part. I have written a piece of code that picks only the fractional part after multiplication, and here are the results.

 Test 0 

Testing +0.5158*+0.0596
A:real_val:+0.5156   fixed: 66   int: 0  frac: 1000010 
B:real_val:+0.0547   fixed: 7    int: 0  frac: 0000111 
C: real_val:+0.0282  fixed: 462  int: 00     frac: 00000111001110 
Floating multiplication: +0.0307 

 Test 1 

Testing +0.4842*-0.9558
A:real_val:+0.4766   fixed: 61   int: 0  frac: 0111101 
B:real_val:-0.9531   fixed: -122     int: 1  frac: 0000110 
C: real_val:-0.4542  fixed: -7442    int: 11     frac: 10001011101110 
Floating multiplication: -0.4628 

 Test 2 

Testing +0.2812*+0.2433
A:real_val:+0.2734   fixed: 35   int: 0  frac: 0100011 
B:real_val:+0.2422   fixed: 31   int: 0  frac: 0011111 
C: real_val:+0.0662  fixed: 1085     int: 00     frac: 00010000111101 
Floating multiplication: +0.0684 

 Test 3 

Testing -0.7235*-0.9037
A:real_val:-0.7188   fixed: -92  int: 1  frac: 0100100 
B:real_val:-0.8984   fixed: -115     int: 1  frac: 0001101 
C: real_val:+0.6458  fixed: 10580    int: 00     frac: 10100101010100 
Floating multiplication: +0.6538 

My question to you is if I am overlooking anything here or if this is normal and expected behaviour from fixed points. If so, I will be happy with my numbers never overflowing during multiplication.

Basically what I mean is that after multiplication of two Q1.X numbers in the form 0.xyz the integer part will always be 0 (if the result is positive) or 1111.. if the result is negative.

So my accumulator register will be filled with only 2*X of meaningful bits and I can take only them, plus the sign.


Solution

  • No, the number of bits in the result is still the sum of the bits in the inputs.

    Summary:

    Signed Q1.31 times signed Q1.31 equals signed Q2.62.

    Unsigned Q1.31 times unsigned Q1.31 equals unsigned Q2.62.

    Explanation:

    Unsigned Q1.n numbers can represent from zero (inclusive) to two (exclusive). If you multiply two such numbers together the range of results is from zero (inclusive) to 4 (exclusive). Just less than four is three point something, and three fits in the two bits above the point.

    Signed Q1.n numbers can represent from negative one (inclusive) to one (exclusive). If you multiply two such numbers together the range of results is negative one (exclusive) to one (inclusive). Signed Q1.31 times signed Q1.31 would fit in Q1.62 except for the single case -1.0 times -1.0 equals +1.0, which requires the extra bit above the point.

    The equations in your question apply equally in both these cases.