Search code examples
floating-pointroundingmultiplicationfloating-accuracyfixed-point

How does rounding works in float multiplication?


The exact value of float 0.7f is 0.69999...
so I thought the result of 0.7f * 100f would be something below 70, like 69.99999...
But the result is exact 70.
Does float multiplication involve such rounding?

If so, is such post processing applicable in fixed-point as well?
I found in some fixed-point libraries, FP(100) * FP(0.7) is 69.99999.
When casting this to int, they ruthlessly truncate and I get 69. Which is undesirable since FP can express exact 70.


Solution

  • Finite-precision binary floating point cannot represent every real (or every decimal) number exactly, but it should always represent the closest possible number.

    Finite-precision floating point arithmetic cannot compute every possible result, either, but it is required (at least in ordinary cases) to compute a properly-rounded result.

    The exact (closest representable) single-precision float corresponding to 0.7 is 0b0.101100110011001100110011, which when converted back to decimal is 0.699999988079071044921875. Notice that this number has 24 significant bits, which is part of the definition of IEEE-754 single precision.

    Multiplying this number by 100 would give us 0b1000101.1111111111111111111011. But that number has 29 significant bits, so it won't fit in the 24 bits of significance available in single precision. So we have to round it off. Now, the 25th bit is 1, so we round up, and just about all the bits to the left of that 25th bit are 1, so it rounds all the way up to 0b1000110.00000000000000000, or exactly 70.0.

    Note that, although single precision floating-point has 24 bits of significance, calculations involving single precision floating point must be carried out using, temporarily, more than 24 bits of precision, so that a few bits beyond the 24th can be accurately computed, so that the result can be correctly rounded, as required.

    And this is a nice example of how IEEE-754's rules about properly-rounded results work, and work well. It's easy to get the (mistaken) impression that floating-point values are always at least a little bit off, if not downright broken. But, in fact, IEEE-754 floating-point arithmetic is usually quite precise, and tries hard to keep errors from compounding — which means that, not infrequently, the errors can cancel each other out, yielding exact results after all. That's basically what happens here.

    Or, in other words, the rule is not that floating-point calculations are always imprecise. The real rule is that floating-point calculations are sometimes imprecise — but they're also, sometimes, perfectly precise.

    (In fact, an even better rule is that floating-point calculations are often imprecise as compared to an expected, but decimal, result. If you take two binary floating-point numbers and do an operation on them, you just about always get a result that's really precise — in binary. My point is that most of the apparent inaccuracies occur only when you compare the binary result to one which you computed, some other way, in decimal.)


    If the above explanation doesn't work for you, here's another way of looking at it. As we know, floating point representations can't represent every number. Indeed, one of the numbers they can't represent in binary is 0.7, and the closest representable number in single precision is a binary number that's equal to 0.699999988079071044921875.

    Now, what if we take that number 0.699999988079071044921875 and multiply it by 100? The exact result should be 69.9999988079071044921875. But this is another number that can't be represented exactly. If you take this nonrepresentable number 69.9999988079071044921875, and ask what the nearest number that can be represented exactly in single precision would be, that number is... 70.0! In single precision, the next representable number less than that is 69.99999237060546875, which is farther away (in this case more than 5 times farther away) than 70.0 is.


    You also asked about fixed-point libraries. There, the answer would obviously depend on the implementation, but more importantly, on the base. A binary fixed-point library can't represent 0.7 exactly, either. But a decimal fixed-point library obviously could.

    An "8.8" binary fixed-point representation of 0.7 would, I think, be 0000000010110011 (base 2), or 00b3 (base 16) or 179 (base 10), which converted back to a fraction is 0.69921875. Multiplying that by 100 gives 100010111101100 / 45ec / 17900, which converts back to 69.921875. So under those assumptions, I don't see a way to do better (that is, a way to get 70.0).

    But if a decimal fixed-point library gave you 69 after multiplying, I'd say it's quite badly implemented. A 16-bit fixed-point decimal representation of 0.7 with a scale factor of 100 would be 70, which when multiplied by 100 gives 7000, which when divided by the scale factor again obviously gives 70.0 exactly.