Search code examples
cnumbersfixed-point

Fixed Point Multiplication of Unsigned numbers


I am trying to solve a multiplication problem with fixed point numbers. The numbers are 32 bit. My architecture is 8 bit. So here goes:

  1. I am using 8.8 notation i.e., 8 for integer, 8 for fraction.

  2. I have A78 which is 10.468. I take its two's complement, and the answer is FFFFF588, which I truncate to 16 bits as F588 and store it. Reason being, I only want to multiply two, 2 byte numbers.

  3. Now when I multiply this F588 (negative 10.42 or 0x0A78) with 0xFF4B which is the two's compliment of 0x00B5 (0.707), answer should be 0x0766. Or something like it.

What I get on the other hand is 66D8.

Now here is where it gets interesting: If I store negative of B5 in two's compliment in 32 bits, I get 0xFF5266D8 which I shift right by 8 bits, truncate then to 16 bits, and answer is 0x5266.

On the other hand if I instead store my negative 10.42 in 32 bits, I get 0xF58F66D8, which after shifting 8 bits and truncating becomes 8F66.

But, if I store both numbers in 32 bit formats, only then I get the correct result after shifting and truncation, which is 0x0766.

Why is this happening? I understand that loss of information is intrinsic when we go from 32 to 16 bits, but 0x07 is much different from 0x55. I will be absolutely grateful for a response.


Solution

  • Let’s look at just the integer representations. You have two 16-bit integers, x and y, and you form their 16-bit two’s complements. However, you keep these 16-bit complements in 32-bit objects. In 32 bits, what you have is 65536–x and 65536–y. (For example, you started with 0xa78, complemented it to make 0xfffff588, and discarded bits to get 0xf588. That equals 0x10000-0xa78.)

    When you multiply these, the result is 65536•65536 – 65536•x – 65536•y + xy.

    The 65536•65536 is 232, so it vanishes because unsigned 32-bit arithmetic is performed modulo 232. You are left with – 65536•x – 65536•y + xy.

    Now you can see the problem: xy is the product of two 16-bit values, so it flows into the high 16 bits of the 32 bits. Up there, you still have – 65536•x – 65536•y, which you do not want.

    An easy way to do this is multiplication to keep all 32 bits of the complement. E.g., when you took the two’s complement of 0xa78, you got 0xfffff588. Then you discarded the high bits, keeping only 0xf588. If you do not do that, you will multiply 0xfffff588 by 0xffffff4b, and the product will be 0x766d8 which, when shifted for the fraction, will be 0x766, which is the result you want.

    If the high bits are lost because you stored the two’s complement into a 16-bit object, then simply restore them when you reload the object, by extending the sign bit. That is, take bit 15 and repeat it in bits 16 to 31. An easy way to do this is to load the 16-bit object into a 16-bit signed integer, then convert the 16-bit signed integer to an unsigned 32-bit integer.