I'm trying to understand how an infinite binary number, for example 0.1
, can be represented in a scientific notation and eventually in a floating point. There are many examples over the web that explain how the number from the scientific notation can be represented in a binary floating point form, but my question is specifically about representing an infinite binary number in a scientific notation. What I don't understand is that since 0.1
can't be finitely represented in binary, what determines when truncation should be done?
For example, if we truncate infinite representation of 0.1
to 62 bits:
0.00011001100110011001100110011001100110011001100110011001100110
the scientific form will be:
1.1001100110011001100110011001100110011001100110011001100110 x
2-4
So from here if we want to represent the number as 64 bit double precision floating point
, we can calculate the exponent as -4 + 1023 = 1019
and represent the number as:
0 1111111011 1001100110011001100110011001100110011001100110011001
*when converting from the scientific form I truncated the mantissa to 52 bits.
In the example above, I decided to truncate to 62 bits. But I can truncate to less or more bits - how's is decided?
First of all you should be rounding, not truncating. You round (to "nearest even") to the number of bits in your floating-point format. Double-precision has 53 bits, so round to 53 bits. For 0.1 you get
1.100110011001100110011001100110011001100110011001101 * 2^-4
In IEEE format, that is
0 01111111011 1001100110011001100110011001100110011001100110011010
(Values courtesy of my decimal to floating-point converter.)