Search code examples
fixed-point

8-bit unsigned fixed point implementation with multiplication and clamping


I'd like to represent numbers in the range [0.0, 1.0] ( optimally including both endpoints) using 8-bit words.

I'd like to be able to multiply them efficiently and addition/subtraction should optimally be clamped to [0,1], not overflow.

For example, if 0xFF would represent 1.0 and 0x00 would represent 0.0, then the multiplication should yield for example

0x3F (0.247) = 0x7F (0.499) * 0x7F (0.499)

I found https://courses.cs.washington.edu/courses/cse467/08au/labs/l5/fp.pdf and I think that what the paper would name U(0,8) corresponds to what I'm looking for, but I don't understand how multiplication for example would need to be implemented.

Is there a c++ library that efficiently implements such a data type or can someone point me to the necesseary math?

I don't need division, only multiplication, addition and subtraction


Solution

  • The fixed-point format you have chosen, U[0.8], does not include the exact endpoint value of 1. The maximum value in this format is actually 0.99609375. If that's close enough for you we can talk about doing the math.

    Multiplying two U[0.8] values gives a 16-bit result in U[0.16] format. To convert back to U[0.8] you must shift right by 8 bit positions. So, multiplying 0x7F times 0x7F gives 0x3F01. Shifting right by 8 bits gives the U[0.8] result of 0x3F, as desired.

    Two values in U[0.8] format can be added or subtracted using normal integer operations. However, you must either prevent overflow/underflow or detect overflow/underflow in the result. To detect overflow in addition you could zero-extend both values to 16 bits, perform the addition, and check to see if the result is greater than 0xFF. If so, you could saturate and return 0xFF.

    For subtraction you could compare the values before doing the subtraction, and if the result would be negative just return zero.