Search code examples
pythonpandasnumpycomparisonfloating-accuracy

How to fix numpy floating-point operations producing inexact results?


I need to rebuild absolute frequencies from relative frequencies knowing the sample size.

This should be easy but absolute frequencies and sample size are numpy.int64, relative frequencies are numpy.float64.

I know floating-point decimal values generally do not have an exact binary representation and we can experience some loss of precision. This seems to be the case, the floating-point operation is producing unexpected results and I can't trust the rebuilt absolute frequencies.

Sample code to replicate the error:

import pandas as pd
import numpy as np

absolutes = np.arange(100000, dtype=np.int64) #numpy.int64
sample_size = absolutes.sum() # numpy.int64
relatives = absolutes / sample_size #float64

# Rebuilding absolutes from relatives

rebuilt_float = relatives * sample_size #float64
rebuilt_int = rebuilt_float.astype(np.int64)

df = pd.DataFrame({'absolutes': absolutes,
                   'relatives': relatives,
                   'rebuilt_float': rebuilt_float,
                   'rebuilt_int': rebuilt_int})

df['check_float'] = df['absolutes'] == df['rebuilt_float']
df['check_int'] = df['absolutes'] == df['rebuilt_int']

print('Failed FLOATS: ', len(df[df['check_float'] == False]))
print('Failed INTS:', len(df[df['check_int'] == False]))
print('Sum of FLOATS:', df['rebuilt_float'].sum())
print('Sum of INTS:', df['rebuilt_int'].sum())

Is it possible to solve the problem using numpy without casting every number to a decimal?


Solution

  • If you round the rebuilt values before converting to integers, you get zero failed ints. That is, use

    rebuilt_int = np.round(rebuilt_float).astype(np.int64)
    

    The output is then

    Failed FLOATS:  11062
    Failed INTS: 0
    Sum of FLOATS: 4999950000.0
    Sum of INTS: 4999950000