Search code examples
cfloating-pointroundingnumerical-methodsunderflow

How can I detect lost of precision due to rounding in both floating point addition and multiplication?


From Computer Systems: a Programmer's Perspective:

With single-precision floating point

  • the expression (3.14f+1e10f)-1e10f evaluates to 0.0: the value 3.14 is lost due to rounding.

  • the expression (1e20f*1e20f)*1e-20f evaluates to +∞ , while 1e20f*(1e20f*1e-20f) evaluates to 1e20f.

  • How can I detect lost of precision due to rounding in both floating point addition and multiplication?

  • What is the relation and difference between underflow and the problem that I described? Is underflow only a special case of lost of precision due to rounding, where a result is rounded to zero?


Solution

  • While in mathematics, addition and multiplication of real numbers are associative operations, those operations are not associative when performed on floating point types, like float, due to the limited precision and range extension.

    So the order matters.

    Considering the examples, the number 10000000003.14 can't be exactly represented as a 32-bit float, so the result of (3.14f + 1e10f) would be equal to 1e10f, which is the closest representable number. Of course, 3.14f + (1e10f - 1e10f) would yeld 3.14f instead.

    Note that I used the f postfix, because in C the expression (3.14+1e10)-1e10 involves double literals, so that the result would be indeed 3.14 (or more likely something like 3.14999).

    Something similar happens in the second example, where 1e20f * 1e20f is already beyond the range of float (but not of double) and the succesive multiplication is meaningless, while (1e20f * 1e-20f), which is performed first in the other expression, has a well defined result (1) and the successive multiplication yelds the correct answer.

    In practice, there are some precautions you adopt

    • Use a wider type. double is a best fit for most applications, unless there are other requirements.
    • Reorder the operations, if possible. For example, if you have to add many terms and you know that some of them are smaller than others, start adding those, then the others. Avoid subtraction of numbers of the same order of magnitude. In general, there may be a more accurate way to evaluate an algebraic expression than the naive one (e.g. Horner's method for polynomial evaluation).
    • If you have some sort of knowledge of the problem domain, you may already know which part of the computation may have problematic values and check if those are greater (or lower) than some limits, before performing the calculation.
    • Check the results as soon as possible. There's no point in continuing a calculation when you already have an infinite value or a NaN, or keep iterating when your target value isn't modified at all.