I am told that a double
in C++ has a mantissa that is capable of safely and accurately representing [-(253 − 1), 253 − 1] integers.
How is this possible when the mantissa is only 52 bits? Why is it that an int16
is only capable of having a range of [-32,768, +32,767], or [-215, 215-1], when the same thing could be used for int16
to allow twice as many representable numbers?
The format of the double
(64 bits) is as follows:
1 bit: sign
11 bits: exponent
52 bits: mantissa
We only need to look at the positive integers we can represent with the mantissa, since the sign bit takes care of the negative integers for us.
Naively, with 52 bits, we can store unsigned integers from 0 to 2^52 - 1. With a sign bit, this lets us store from -2^52 - 1
to 2^52 - 1
.
However, we have a little trick we can use. We say that the first digit of our integer is always a 1, which gives us an extra bit to work with.
To see why this works, let's dig a little deeper.
Every positive integer will have at least one 1
in its binary representation. So, we shift the mantissa left or right until we get a 1 at the start, using the exponent. An example might help here:
9, represented as an unsigned int: 000...0001001
(dots representing more 0
s).
Written another way: 1.001 * 2^3
. (1.001
being in binary, not decimal.)
And, we'll agree to never use a 0 as the first bit. So even though we could write 9 as 0.1001 * 2^4
or 0.01001 * 2^5
, we won't. We'll agree that when we write the numbers out in this format, we'll always make sure we use the exponent to "shift" bits over until we start with a 1
.
So, the information we need to store to get 9
is as follows:
e: 3
i: 1.001
But if i
always starts with 1
, why bother writing it out every time? Let's just keep the following instead:
e: 3
i: 001
Using precisely this information, we can reconstruct the number as: 1.i * 2^e == 9
.
When we get up to larger numbers, our "i
" will be bigger, maybe up to 52 bits used, but we're actually storing 53
bits of information because of the leading 1
we always have.
Final Note: This is not quite what is stored in the exponent and mantissa of a double, I've simplified things to help explain, but hopefully this helps people understand where the missing bit is from. Also, this does not cover 0
, which gets a special representation (the trick used above will not work for 0
, since the usual representation of 0
doesn't have any 1
s in it).