Search code examples
c++floating-pointrounding

Floating point serialization/parsing rounding


I have the following floating point number stored in memory (float, not double)

0b00111101111001000010000111001000

which, according to floating-point-converter converts to

0.111392557621002197265625

I serialize it, rounding up to the 9th digit, and storing it as the following string:

0.111392558

I understood that, if now I parse the string, I'm using using stof, I should expect some precision loss, which would imply that I'm not guaranteed that the parsed value would result in the original value, that is

0.111392557621002197265625

The fact is that I'm doing some unit testing and I consistently find that my expectations never met, original and parsed values are always the same.

Are my expectations right?


Solution

  • Supposing you mean nine significant digits, not just nine digits after the decimal point, then the farthest apart two consecutive nine-significant-digit decimal numerals can be is one part in 108. This occurs when the first digit is 1 and the ninth digit changes by 1.

    The closest together two consecutive 24-digit binary numerals can be is one part in 224. This occurs when all the digits are 1 and we add 1 at the last digit (carrying the number up to the next power of two).

    Consider, in general, rounding a number x to a numerical format using a round-to-nearest method. Suppose x is between two values representable in the format, a and b. If a is nearer to x than b is, we round x to a. If b is nearer, we round to b. In either case, the rounding error is less than half the distance between a and b, since we chose the nearer value. If x is exactly in the middle, the rounding error is half the distance. So, with round-to-nearest, the rounding error is at most half the distance between representable values. Thus, we have a Rounding Lemma: If x rounds to x', then |x'−x| ≤ ½U, where U is the distance between adjacent representable numbers bracketing x.

    Now consider more specifically rounding a 24-digit binary numeral x to a nine-digit decimal numeral. Let D be the distance between nine-digit decimal numerals near x, and let B be the distance between 24-digit binary numerals near x. From above, we know D < B. Rounding x to a nine-digit decimal numeral yields some number x', and we know from the Rounding Lemma that |x'−x| ≤ ½D. Since D < B, |x'−x| < ½B. This may also be expressed as −½B < xx' < ½B, which implies x'−½B < x < x'+½B.

    Now let x'' be the number we get by rounding x' back to a 24-digit binary numeral. To satisfy the Rounding Lemma, |x'−x''| ≤ ½B, which means −½Bx''−x' ≤ ½B, which implies x'−½Bx'' ≤ x'+½B.

    Now we have both x'−½B < x < x'+½B and x'−½Bx'' ≤ x'+½B, which means both x and x'' are in the same interval from x'−½B to x'+½B, except that x'' can be on an endpoint of the interval and x cannot. The length of that interval is B, and 24-bit binary numerals are spaced B apart, so one of two things are true: Either there is exactly one 24-bit binary numeral inside the interval or there are two 24-bit binary numerals, one at each endpoint of the interval. The latter is impossible since then there would be no 24-bit binary numeral inside the interval to be x. So there is only one 24-bit binary numeral in the interval, and both x and x'' must be that numeral. Therefore x'' = x.

    Thus, rounding a 24-bit binary numeral x to a nine-digit decimal numeral x' and then back to a 24-bit binary numeral must produce x.

    Converting to an eight-digit decimal numeral is not always sufficient to restore the original 24-digit binary numeral. We can see this in several ways:

    • In the interval [10, 11), there are 106 = 1,000,000 eight-digit decimal numerals (with form 10.dddddd, where each d is any decimal digit), but there are 220 = 1,048,576 24-digit binary numerals (with form 1010.bbbbbbbbbbbbbbbbbbbb2, where each b is any binary digit). Therefore, converting these decimal numerals to 24-digit binary numerals can produce only 1,000,000 results, so at least 48,576 of the binary numerals cannot be produced.
    • The spacing between eight-digit decimal numerals can reach one part in 10,000,000, which is not fine enough to distinguish 24-bit binary numerals with spacings of one part in 16,777,216.
    • 134,217,704 and 134,217,696 are both representable in the IEEE-754 binary32 format commonly used for float, and, when converted to an eight-significant-digit decimal numeral, they both round to 134,217,700, so converting back to binary can only produce one of those numbers, so one of them does not survive the round-trip.