Search code examples
pythonnumpyfloating-accuracy

For a given precision, what is the maximum value for which a float32 will give the same result as a float64?


With numpy, I'm trying to understand what is the maximum value that can be downcasted from float64 to float32 with a loss on accuracy less or equal to 0.001.

Since I could not find a simple explanation online, I quickly came up with this piece of code to test :

result = {}
for j in range(1,1000):
    for i in range (1, 1_000_000):
        num = i + j/1000
        x=np.array([num],dtype=np.float32)
        y=np.array([num],dtype=np.float64)
        if abs(x[0]-y[0]) > 0.001:
            result[j] = i
            break

Based on the results, it seems any positive value <32768 can be safely downcasted from float64 to float32 with an acceptable loss on accuracy (given the criteria of <=0.001)

Is this correct ? Could someone explain the math behind ?

Thanks a lot


Solution

  • Assuming IEEE 754 representation, float32 has a 24-bit significand precision, while float64 has a 53-bit significand precision (except for “denormal” numbers).

    In order to represent a number with an absolute error of at most 0.001, you need at least 9 bits to the right of the binary point, which means the numbers are rounded off to the nearest multiple of 1/512, thus having a maximum representation error of 1/1024 = 0.0009765625 < 0.001.

    With 24 significant bits in total, and 9 to the right of the binary point, that leaves 15 bits to the left of the binary point, which can represent all integers less than 215 = 32768, as you have experimentally determined.

    However, there are some numbers higher than this threshold that still have an error less than 0.001. As Eric Postpischil pointed out in his comment, all float64 values between 32768.0 and 32768.001 (the largest being exactly 32768+137438953/237), which the float32 conversion rounds down to exactly 32768.0, meet your accuracy requirement. And of course, any number that happens to be exactly representable in a float32 will have no representation error.