From Computer Systems: a Programmer's Perspective:
With single-precision floating point
the expression
(3.14f+1e10f)-1e10f
evaluates to 0.0: the value 3.14 is lost due to rounding.the expression
(1e20f*1e20f)*1e-20f
evaluates to +∞ , while1e20f*(1e20f*1e-20f)
evaluates to1e20f
.
How can I detect lost of precision due to rounding in both floating point addition and multiplication?
What is the relation and difference between underflow and the problem that I described? Is underflow only a special case of lost of precision due to rounding, where a result is rounded to zero?
While in mathematics, addition and multiplication of real numbers are associative operations, those operations are not associative when performed on floating point types, like float
, due to the limited precision and range extension.
So the order matters.
Considering the examples, the number 10000000003.14 can't be exactly represented as a 32-bit float
, so the result of (3.14f + 1e10f)
would be equal to 1e10f
, which is the closest representable number. Of course, 3.14f + (1e10f - 1e10f)
would yeld 3.14f
instead.
Note that I used the f
postfix, because in C the expression (3.14+1e10)-1e10
involves double
literals, so that the result would be indeed 3.14
(or more likely something like 3.14999).
Something similar happens in the second example, where 1e20f * 1e20f
is already beyond the range of float
(but not of double
) and the succesive multiplication is meaningless, while (1e20f * 1e-20f)
, which is performed first in the other expression, has a well defined result (1) and the successive multiplication yelds the correct answer.
In practice, there are some precautions you adopt
double
is a best fit for most applications, unless there are other requirements.