Search code examples
floating-pointieee-754fixed-point

Why does a fixed-point scaling factor tend to be a power of two?


Assume we have two floating point values: 1.23 and 4.56. To represent and add these in a machine without floating point support, we will have to fall back to fixed point representation.

So we pick the number 100 as a scaling factor, simply to get rid if the decimal points:

1 - Multiple them by scaling factor => 123 and 456

2 - Add them 123 + 456 = 579

3- Divide it by the same scaling factor => 5.79

Which is equal to the floating point add 1.23 + 4.56 = 5.79

Now, why do I keep reading on online articles that scaling factor tends to be a power of two?

https://en.wikipedia.org/wiki/Scale_factor_(computer_science)

https://www.allaboutcircuits.com/technical-articles/fixed-point-representation-the-q-format-and-addition-examples/

If I choose say 2^5 = 32 as my scaling factor then we have:

-> 1.23 * 32 = 39.36 ~= 39
-> 4.56 * 32 = 145.92 ~= 145
-> 39 + 149 = 188 
-> 188 / 32 = 5.87

The output of 5.87 is not even precise. So why do we pick power of 2? Why don't we just pick a power of 10 as the factor?

Edit

I have also seen in such posts: https://spin.atomicobject.com/2012/03/15/simple-fixed-point-math/

That power of two is chosen since computers can represent them fast, i.e 2^16 can be done with bit shifting : 1 << 16, but power of 10 can't be computed as fast.

So is that it? we basically destroy precision for a bit of latency (if at all)?


Solution

  • Which is equal to the floating point add 1.23 + 4.56 = 5.79

    Not quite.

    1.23, 4.56, 5.79 as source code are exactly representable. As floating-point encoded with binary64, they are not. Much like 0.3333 is not exactly one-third, IEE-754 binary uses nearby values - within 1 part in 253. Thus the addition may provide the expected sum, or maybe a very close other sum will occur.

    why do I keep reading on online articles that scaling factor tends to be a power of two?

    With binary floating point, scaling by powers of 2 injects no precision loss. The product is exactly as good as its pre-scaled value.

    Why don't we just pick a power of 10 as the factor?

    Scaling by powers of 10 works well on paper (classical math), yet with binary floating point, the product likely is not exact and instead a rounded value is used. Thus our scaling injects a little error.

    So is that it? we basically destroy precision for a bit of latency (if at all)?

    No, there are many more issues. Since there are so many issues and speed is important, manufacturers of floating point hardware need an incredibly specific IEEE-754. Even after 40 years, corner cases come up. For over the past 20 years a decimal version of IEEE-754 exist too. That portion of the overall spec is slowing getting realized in hardware instead of the slooooow software decimal floating point implementations. Until the marketplace drives for wider acceptance, binary floating point with its difference between classical math (1.23 + 4.56) will continue to dominate versus switching to decimal floating point.