I'm trying to understand what minimum number I need to add to get Infinity
because of overflow. I've read this answer already. So let me just clarify my understanding here. To simplify, I'll be working with 1 byte floating point with 4 bits for exponent and 3 bits for mantissa:
0 0000 000
The maximum positive number I can store in it is this:
0 1110 111
which is when converted to scientific notation:
1.111 x 2^{7} = 11110000
Is my understanding correct that the minimum number I should add to get Infinity
is 00010000
:
11110000
+ 00010000
--------
1 00000000
As I understand anything less than 00010000
will not cause overflow and the result will be rounded to 11110000
. But the 00010000
is 0 0000 001
in floating point format, and it's the number 1
. So is adding just 1
enough to cause overflow?
The answer is given in the other answer to the question you link to. The smallest value which will round to infinity is:
c = 27 × ( 2 − ½ × 21-4 ) = 1.9375 × 27 = 1.11112 × 27
So the smallest value that you can add to get infinity is
c − fmax = 1.11112 × 27 − 1.1112 × 27 = 0.00012 × 27 = 23
which, if I understand correctly, would have bit pattern 0 1010 000
in your proposed format.
UPDATE: so why is it this particular cutoff?
Suppose that there was another binade above this one, then the next floating point number would be
x = 1.0002 × 28
Note that c is the value that is exactly halfway between x and fmax. In other words, the values which would round up to x are instead rounded to infinity, but the values which would round down to fmax still round to the same value.