Search code examples
javascriptfloating-pointieee-754

What is the minimum number I need to add to get Infinity for 1 byte floating point


I'm trying to understand what minimum number I need to add to get Infinity because of overflow. I've read this answer already. So let me just clarify my understanding here. To simplify, I'll be working with 1 byte floating point with 4 bits for exponent and 3 bits for mantissa:

0 0000 000

The maximum positive number I can store in it is this:

0 1110 111

which is when converted to scientific notation:

   1.111 x 2^{7} = 11110000

Is my understanding correct that the minimum number I should add to get Infinity is 00010000:

       11110000
+      00010000
        --------
     1 00000000

As I understand anything less than 00010000 will not cause overflow and the result will be rounded to 11110000. But the 00010000 is 0 0000 001 in floating point format, and it's the number 1. So is adding just 1 enough to cause overflow?


Solution

  • The answer is given in the other answer to the question you link to. The smallest value which will round to infinity is:

    c = 27 × ( 2 − ½ × 21-4 ) = 1.9375 × 27 = 1.11112 × 27

    So the smallest value that you can add to get infinity is

    cfmax = 1.11112 × 27 − 1.1112 × 27 = 0.00012 × 27 = 23

    which, if I understand correctly, would have bit pattern 0 1010 000 in your proposed format.

    UPDATE: so why is it this particular cutoff?

    Suppose that there was another binade above this one, then the next floating point number would be

    x = 1.0002 × 28

    Note that c is the value that is exactly halfway between x and fmax. In other words, the values which would round up to x are instead rounded to infinity, but the values which would round down to fmax still round to the same value.