c floating-point precision numerical-methods ieee-754

What is the minimum number of significant decimal digits in a floating point literal to represent the value as correct as possible?

For example, using IEEE-754 32-bit binary floating points, let's represent the value of 1 / 3. It cannot be done exactly, but 0x3eaaaaab produces the closest value to 1 / 3. You might want to write the value in decimal, and let the compiler to convert the decimal literal to a binary floating point number.

0.333333f    -> 0x3eaaaa9f (0.333332986)
0.3333333f   -> 0x3eaaaaaa (0.333333313)
0.33333333f  -> 0x3eaaaaab (0.333333343)
0.333333333f -> 0x3eaaaaab (0.333333343)

You can see that 8 (significant) decimal digits is enough to represent the value as correct as possible (closest to the actual value).

I tested with π and e (base of the natural log), and both needed 8 decimal digits for the correctest.

3.14159f    -> 0x40490fd0 (3.14159012)
3.141593f   -> 0x40490fdc (3.14159298)
3.1415927f  -> 0x40490fdb (3.14159274)
3.14159265f -> 0x40490fdb (3.14159274)

2.71828f    -> 0x402df84d (2.71828008)
2.718282f   -> 0x402df855 (2.71828198)
2.7182818f  -> 0x402df854 (2.71828175)
2.71828183f -> 0x402df854 (2.71828175)

However, √2 appears to need 9 digits.

1.41421f     -> 0x3fb504d5 (1.41420996)
1.414214f    -> 0x3fb504f7 (1.41421402)
1.4142136f   -> 0x3fb504f4 (1.41421366)
1.41421356f  -> 0x3fb504f3 (1.41421354)
1.414213562f -> 0x3fb504f3 (1.41421354)

https://godbolt.org/z/W5vEcs695

Looking at these results, it's probably right that a decimal floating-point literal with 9 significant digits is sufficient to produce a most correct 32-bit binary floating point value, and in practice something like 12~15 digits would work for sure if space for storing the extra digits doesn't matter.

But I'm interested in the math behind it. How can one be sure that 9 digits is enough in this case? What about double or even arbitrary precision, is there a simple formula to derive the number of digits needed?

The current answers and the links in the comments confirm that 9 digits is enough for most cases, but I've found a counterexample where 9 digits is not enough. In fact, infinite precision in the decimal format is required to be always correctly converted (rounded to the closest) to some binary floating point format (IEEE-754 binary32 floats for the discussion).

8388609.499 represented with 9 significant decimal digits is 8388609.50. This number converted to float has the value of 8388610. On the other hand, the number represented with 10 or more digits will always preserve the original value, and this number converted to float has the value 8388609.

You can see 8388609.499 needs more than 9 digits to be most accurately converted to float. There are infinitely many such numbers, placed very close to the half point of two representable values in the binary float format.

Solution

What about double or even arbitrary precision, is there a simple formula to derive the number of digits needed?>

From C17 § 5.2.4.2.2 11 FLT_DECIMAL_DIG, DBL_DECIMAL_DIG, LDBL_DECIMAL_DIG

number of decimal digits, n, such that any floating-point number with p radix b digits can be rounded to a floating-point number with n decimal digits and back again without change to the value,

p_max log₁₀ b: if b is a power of 10
1 + p_max log₁₀ b: otherwise

But I'm interested in the math behind it. How can one be sure that 9 digits is enough in this case?

Each range of binary floating point like [1.0 ... 2.0), [128.0 ... 256.0), [0.125 ... 0.5) contains 2^{p - 1} values uniformly distributed. e.g. With float, p = 24.

Each range of a decade of decimal text with n significant digits in exponential notation like [1.0 ... 9.999...), [100.0f ... 999.999...), [0.001 ... 0.00999...) contains 10^{n - 1} values uniformly distributed.

Example: common float:
When p is 24 with 2²⁴ combinations, n must at least 8 to form the 16,777,216 combinations to distinctly round-trip float to decimal text to float. As the end-points of two decimal ranges above may exist well within that set of 2²⁴, the larger decimal values are spaced out further apart. This necessitates a +1 decimal digit.

Example:

Consider the 2 adjacent float values

10.000009_5367431640625
10.000010_49041748046875

Both convert to 8 significant digits decimal text "10.000010". 8 is not enough.

9 is always enough as we do not need more than 167,772,160 to distinguish 16,777,216 float values.

OP also asks about 8388609.499. (Let us only consider float for simplicity.)

That value is nearly half-way between 2 float values.

8388609.0f  // Nearest lower float value
8388609.499 // OP's constant as code
8388610.0f  // Nearest upper float value

OP reports: "You can see 8388609.499 needs more than 9 digits to be most accurately converted to float."

And let us review the title "What is the minimum number of significant decimal digits in a floating point literal^*1 to represent the value as correct as possible?"

This new question part emphasizes that the value in question is the value of the source code 8388609.499 and not the floating point constant it becomes in emitted code: 8388608.0f.

If we consider the value to be the value of the floating point constant, only up to 9 significant decimal digits are needed to define the floating point constant 8388608.0f. 8388608.49, as source code is sufficient.

But to get the closest floating point constant based on some number as code yes indeed could take many digits.

Consider the typical smallest float, FLT_TRUE_MIN with the exact decimal value of :

0.00000000000000000000000000000000000000000000140129846432481707092372958328991613128026194187651577175706828388979108268586060148663818836212158203125

Half way between that and 0.0 is 0.000..(~39 more zeroes)..0007006..(~ 100 more digits)..15625.

It that last digit was 6 or 4, the closest float would be FLT_TRUE_MIN or 0.0f respectively. So now we have a case where 109 significant digits are "needed" to select between 2 possible float.

To forego us going over the cliffs of insanity, IEEE-758 has already addressed this.

The number of significant decimal digits a translation (compiler) must examine to be compliant with that spec (not necessarily the C spec) is far more limited, even if the extra digits could translate to another FP value.

IIRC, it is in effect FLT_DECIMAL_DIG + 3. So for a common float, as little as 9 + 3 significant decimal digits may be examined.

[Edit]

correct rounding is only guaranteed for the number of decimal digits required plus 3 for the largest supported binary format.

^*1 C does not define: floating point literal, but does define floating point constant, so that term is used.