Search code examples
c++visual-c++fixed-point

How to fix the position of binary point in an unsigned N-bit interger?


I am working on developing a fixed point algorithm in C++. I know that, for a N-bit integer, the fixed point binary integer is represented as U(a,b). For example, for an 8 bit Integer (i.e 256 samples), If we represent it in the form U(6,2), it means that the binary point is to the left of the 2nd bit starting from the right of the form:

                   b5 b4 b3 b2 b1 b0 . b(-1) b(-2)

Thus , it has 6 integer bits and 2 fractional bits. In C++, I know there are some bit shift operators I can use, but they are basically used for shifting the bits of the input stream, my question is, how to define a binary fixed point integer of the form, fix<6,2> or U(6,2). All the major processing operation will be carried out on the fractional part and I am just finding a way to do this fix in C++. Any help regarding this would be appreciated.Thanks!

Example : Suppose I have an input discrete signal with 1024 sample points on x-axis (For now just think this input signal is coming from some sensor). Each of this sample point has a particular amplitude. Say the sample at time 2(x-axis) has an amplitude of 3.67(y-axis). Now I have a variable "int *input;" that takes the sample 2, which in binary is 0000 0100. So basically I want to make this as 00000.100 by performing the U(5,3) on the sample 2 in C++. So that I can perform the interpolation operations on fractions of the input sampling period or time.

PS - I don't want to create a separate class or use external libraries for this. I just want to take each 8 bits from my input signal, perform the U(a,b) fix on it followed by rest of the operations are done on the fractional part.


Solution

  • Short answer: left shift.

    Long answer:

    1. Fixed point numbers are stored as integers, usually int, which is the fastest integer type for a particular platform.

    2. Normal integer without fractional bits are usually called Q0, Q.0 or QX.0 where X is the total number of bits of underlying storage type(usually int).

    3. To convert between different Q.X formats, left or right shift. For example, to convert 5 in Q0 to 5 in Q4, left shift it 4 bits, or multiply it by 16.

    4. Usually it's useful to find or write a small fixed point library that does basic calculations, like a*b>>q and (a<<q)/b. Because you will do Q.X=Q.Y*Q.Z and Q.X=Q.Y/Q.Z a lot and you need to convert formats when doing calculations. As you may have observed, using normal * operator will give you Q.(X+Y)=Q.X*Q.Y, so in order to fit the result into Q.Z format, you need to right shift the result by (X+Y-Z) bits.

    5. Division is similar, you get Q.(X-Y)=Q.X*Q.Y form the standard / operator, and to get the result in Q.Z format you shift the dividend before the division. What's different is that division is an expensive operation, and it's not trivial to write a fast one from scratch.
    6. Be aware of double-word support of your platform, it will make your life a lot easier. With double word arithmetic, result of a*b can be twice the size of a or b, so that you don't lose range by doing a*b>>c. Without double word, you have to limit the input range of a and b so that a*b doesn't overflow. This is not obvious when you first start, but soon you will find you need more fractional bits or rage to get the job done, and you will finally need to dig into the reference manual of your processor's ISA.

    example:

    float a = 0.1;// 0.1
    int aQ16 = a*65536;// 0.1 in Q16 format
    int bQ16 = 4<<16// 4Q16
    int cQ16 = a*b>>16 // result = 0.399963378906250Q16 = 26212, 
                       // not 0.4Q16 = 26214 because of truncating error