With numpy, I'm trying to understand what is the maximum value that can be downcasted from float64 to float32 with a loss on accuracy less or equal to 0.001.
Since I could not find a simple explanation online, I quickly came up with this piece of code to test :
result = {}
for j in range(1,1000):
for i in range (1, 1_000_000):
num = i + j/1000
x=np.array([num],dtype=np.float32)
y=np.array([num],dtype=np.float64)
if abs(x[0]-y[0]) > 0.001:
result[j] = i
break
Based on the results, it seems any positive value <32768 can be safely downcasted from float64 to float32 with an acceptable loss on accuracy (given the criteria of <=0.001)
Is this correct ? Could someone explain the math behind ?
Thanks a lot
Assuming IEEE 754 representation, float32
has a 24-bit significand precision, while float64
has a 53-bit significand precision (except for “denormal” numbers).
In order to represent a number with an absolute error of at most 0.001, you need at least 9 bits to the right of the binary point, which means the numbers are rounded off to the nearest multiple of 1/512, thus having a maximum representation error of 1/1024 = 0.0009765625 < 0.001.
With 24 significant bits in total, and 9 to the right of the binary point, that leaves 15 bits to the left of the binary point, which can represent all integers less than 215 = 32768, as you have experimentally determined.
However, there are some numbers higher than this threshold that still have an error less than 0.001. As Eric Postpischil pointed out in his comment, all float64
values between 32768.0 and 32768.001 (the largest being exactly 32768+137438953/237), which the float32
conversion rounds down to exactly 32768.0, meet your accuracy requirement. And of course, any number that happens to be exactly representable in a float32
will have no representation error.