I need to rebuild absolute frequencies from relative frequencies knowing the sample size.
This should be easy but absolute frequencies and sample size are numpy.int64
, relative frequencies are numpy.float64
.
I know floating-point decimal values generally do not have an exact binary representation and we can experience some loss of precision. This seems to be the case, the floating-point operation is producing unexpected results and I can't trust the rebuilt absolute frequencies.
Sample code to replicate the error:
import pandas as pd
import numpy as np
absolutes = np.arange(100000, dtype=np.int64) #numpy.int64
sample_size = absolutes.sum() # numpy.int64
relatives = absolutes / sample_size #float64
# Rebuilding absolutes from relatives
rebuilt_float = relatives * sample_size #float64
rebuilt_int = rebuilt_float.astype(np.int64)
df = pd.DataFrame({'absolutes': absolutes,
'relatives': relatives,
'rebuilt_float': rebuilt_float,
'rebuilt_int': rebuilt_int})
df['check_float'] = df['absolutes'] == df['rebuilt_float']
df['check_int'] = df['absolutes'] == df['rebuilt_int']
print('Failed FLOATS: ', len(df[df['check_float'] == False]))
print('Failed INTS:', len(df[df['check_int'] == False]))
print('Sum of FLOATS:', df['rebuilt_float'].sum())
print('Sum of INTS:', df['rebuilt_int'].sum())
Is it possible to solve the problem using numpy without casting every number to a decimal?
If you round the rebuilt values before converting to integers, you get zero failed ints. That is, use
rebuilt_int = np.round(rebuilt_float).astype(np.int64)
The output is then
Failed FLOATS: 11062
Failed INTS: 0
Sum of FLOATS: 4999950000.0
Sum of INTS: 4999950000