c++floating-point floating-accuracy floating-point-precision

IEEE754 float point substraction precision lost

Here is the subtraction

First number

Decimal       3.0000002
Hexadecimal   0x4040001
Binary: Sign[0], Exponent[1000_0000], Mantissa[100_0000_0000_0000_0000_0001]

substract second number:

Decimal 3.000000
Hexadecimal 0x4040000
Binary: Sign[0], Exponent[1000_0000], Mantissa[100_0000_0000_0000_0000_0000]

==========================================

At this situation, the exponent is already same, we just need to substract the mantissa. We know in IEEE754, there is a hiding bit 1 in front of mantissa. Therefore, the result mantissa should be:

Mantissa_1[1100_0000_0000_0000_0000_0001] - Mantissa_2[1100_0000_0000_0000_0000_0000]

which equal to

Mantissa_Rst = [0000_0000_0000_0000_0000_0001]

But this number is not normalized, Because of the first hiding bit is not 1. Thus we shift the Mantissa_Rst right 23 times, and the exponent minuses 23 at the same time.

Then we have the result value

Hexadecimal 0x4040000

Binary: Sign[0], Exponent[0110_1000], Mantissa[000_0000_0000_0000_0000_0000].

32 bits total, no rounding needed.

Notice that in the mantissa region, there still is a hidden 1.

If my calculations were correct, then converting result to decimal number is 0.00000023841858, comparing with the real result 0.0000002, I still think that is not very precise.

So the question is, are my calculations wrong? or actually this is a real situation and happens all the time in computer?

Solution

The inaccuracy already starts with your input. 3.0000002 is a fraction with a prime factor of five in the denominator, so its "decimal" expansion in base 2 is periodic. No amount of mantissa bits will suffice to represent it exactly. The float you give actually has the value 3.0000002384185791015625 (this is exact). Yes, this happens all the time.

Don't despair, though! Base ten has the same problem (for example 1/3). It isn't a problem. Well, it is for some people, but luckily there are other number types available for their needs. Floating point numbers have many advantages, and slight rounding error is irrelevant for many applications, for example when not even your inputs are perfectly accurate measurements of what you're interested in (a lot of scientific computing and simulation). Also remember that 64-bit floats also exist. Additionally, the error is bounded: With the best possible rounding, your result will be within 0.5 units in the last place removed from the infinite-precision result. For a 32-bit float of the magnitude as your example, this is approximately 2^-25, or 3 * 10^-8. This gets worse and worse as you do additional operations that have to round, but with careful numeric analysis and the right algorithms, you can get a lot of milage out of them.