Out of curiosity and wanting to learn more about floating point, I ran the following C code:
#include <stdio.h>
int main() {
float a = 1.0 + ((float) (1 << 22));
float b = 1.0 + ((float) (1 << 23));
float c = 1.0 + ((float) (1 << 24));
printf("a = %.6f\n", a);
printf("b = %.6f\n", b);
printf("c = %.6f", c);
}
The results were:
a = 4194305.000000
b = 8388609.000000
c = 16777216.000000
I'm confused on why I got these results. Can anyone explain why the bit layout of a, b, and c causes each value to be what it is? I'm new to bit shifting and floats and a clear explanation would be greatly appreciated. Thank you.
(1 << 22)
is an integer value equal to
2^22 = 4194304
then you convert it to float by doing (float) (1 << 22)
which gives you the same value
4194304.0
and then you add 1.0 to get the result 4194305.0
The same applies to the other cases.
So this is not about "layout of floats" - it's rather about layout of integers and conversion from integer to float.
However, the last case where you use 1 << 24
is a bit interesting (and relates to float format).
(1 << 24) is 16777216
and can be converted to the same float value, i.e.
16777216.0
But when you do
1.0 + 16777216.0
you still get
16777216.0
The reason is the limited precision of floats (i.e. not all numbers can be presented in the float format). The value 16777217.0 can't be presented in the float format so adding 1.0 to 16777216.0 still gives you 16777216.0
BTW: There are several rounding modes (see e.g. https://en.wikipedia.org/wiki/Floating-point_arithmetic#Rounding_modes) so when an exact result can't be presented in the float format, you need to know your systems rounding mode to figure out which value will be used instead of the exact result.