Search code examples
mathlanguage-agnosticfloating-point

What types of numbers are representable in binary floating-point?


I've read a lot about floats, but it's all unnecessarily involved. I think I've got it pretty much understood, but there's just one thing I'd like to know for sure:

I know that, fractions of the form 1/pow(2,n), with n an integer, can be represented exactly in floating point numbers. This means that if I add 1/32 to itself 32 million times, I would get exactly 1,000,000.

What about something like 1/(32+16)? It's one over the sum of two powers of two, does this work? Or is it 1/32+1/16 that works? This is where I'm confused, so if anyone could clarify that for me I would appreciate it.


Solution

  • The rule can be summed up as this:

    • A number can be represented exactly in binary if the prime factorization of the denominator contains only 2 (i.e., the denominator is a power-of-two).

    So 1/(32 + 16) is not representable in binary because it has a factor of 3 in the denominator. But 1/32 + 1/16 = 3/32 is.

    That said, there are more restrictions to be representable in a floating-point type. For example, you only have 53 bits of mantissa in an IEEE double so 1/2 + 1/2^500 is not representable.

    So you can do sum of powers-of-two as long as the range of the exponents doesn't span more than 53 powers.


    To generalize this to other bases:

    • A number can be exactly represented in base 10 if the prime factorization of the denominator consists of only 2's and 5's.

    • A rational number X can be exactly represented in base N if the prime factorization of the denominator of X contains only primes found in the factorization of N.