Search code examples
binaryfloating-pointieee-754

What defines when truncation should occur on an infinite binary number (0.1) to represent it in a scientific notation


I'm trying to understand how an infinite binary number, for example 0.1, can be represented in a scientific notation and eventually in a floating point. There are many examples over the web that explain how the number from the scientific notation can be represented in a binary floating point form, but my question is specifically about representing an infinite binary number in a scientific notation. What I don't understand is that since 0.1 can't be finitely represented in binary, what determines when truncation should be done?

For example, if we truncate infinite representation of 0.1 to 62 bits:

0.00011001100110011001100110011001100110011001100110011001100110

the scientific form will be:

1.1001100110011001100110011001100110011001100110011001100110 x 2-4

So from here if we want to represent the number as 64 bit double precision floating point, we can calculate the exponent as -4 + 1023 = 1019 and represent the number as:

0 1111111011 1001100110011001100110011001100110011001100110011001

*when converting from the scientific form I truncated the mantissa to 52 bits.

In the example above, I decided to truncate to 62 bits. But I can truncate to less or more bits - how's is decided?


Solution

  • First of all you should be rounding, not truncating. You round (to "nearest even") to the number of bits in your floating-point format. Double-precision has 53 bits, so round to 53 bits. For 0.1 you get

    1.100110011001100110011001100110011001100110011001101 * 2^-4

    In IEEE format, that is

    0 01111111011 1001100110011001100110011001100110011001100110011010

    (Values courtesy of my decimal to floating-point converter.)