Search code examples
c++floating-pointmantissa

Why does the bit-width of the mantissa of a floating point represent twice as many numbers compared to an int?


I am told that a double in C++ has a mantissa that is capable of safely and accurately representing [-(253 − 1), 253 − 1] integers.

How is this possible when the mantissa is only 52 bits? Why is it that an int16 is only capable of having a range of [-32,768, +32,767], or [-215, 215-1], when the same thing could be used for int16 to allow twice as many representable numbers?


Solution

  • The format of the double (64 bits) is as follows:

    1 bit: sign
    11 bits: exponent
    52 bits: mantissa
    

    We only need to look at the positive integers we can represent with the mantissa, since the sign bit takes care of the negative integers for us.

    Naively, with 52 bits, we can store unsigned integers from 0 to 2^52 - 1. With a sign bit, this lets us store from -2^52 - 1 to 2^52 - 1.

    However, we have a little trick we can use. We say that the first digit of our integer is always a 1, which gives us an extra bit to work with.

    To see why this works, let's dig a little deeper.

    Every positive integer will have at least one 1 in its binary representation. So, we shift the mantissa left or right until we get a 1 at the start, using the exponent. An example might help here:

    9, represented as an unsigned int: 000...0001001 (dots representing more 0s).

    Written another way: 1.001 * 2^3. (1.001 being in binary, not decimal.)

    And, we'll agree to never use a 0 as the first bit. So even though we could write 9 as 0.1001 * 2^4 or 0.01001 * 2^5, we won't. We'll agree that when we write the numbers out in this format, we'll always make sure we use the exponent to "shift" bits over until we start with a 1.

    So, the information we need to store to get 9 is as follows:

    e: 3
    i: 1.001
    

    But if i always starts with 1, why bother writing it out every time? Let's just keep the following instead:

    e: 3
    i: 001
    

    Using precisely this information, we can reconstruct the number as: 1.i * 2^e == 9.

    When we get up to larger numbers, our "i" will be bigger, maybe up to 52 bits used, but we're actually storing 53 bits of information because of the leading 1 we always have.

    Final Note: This is not quite what is stored in the exponent and mantissa of a double, I've simplified things to help explain, but hopefully this helps people understand where the missing bit is from. Also, this does not cover 0, which gets a special representation (the trick used above will not work for 0, since the usual representation of 0 doesn't have any 1s in it).