Why do Pandas sum()
and Pythons sum()
on a list of floating point numbers yield slightly different results generating a difference when rounding the result
>>> import pandas as pd
>>> from decimal import Decimal
>>> numbers = [0.495,1.495,2.495,3.495,4.495,5.495,6.495, 7.495,8.495, 9.495, 10.495]
>>> Decimal(sum(numbers))
Decimal('60.44500000000000028421709430404007434844970703125')
>>> round(Decimal(sum(numbers)),2)
Decimal('60.45')
>>> Decimal(float(pd.DataFrame(numbers).sum()))
Decimal('60.44499999999999317878973670303821563720703125')
>>> round(Decimal(float(pd.DataFrame(numbers).sum())),2)
Decimal('60.44')
So despite using the same round()
function, the slight difference in the sum()
of the numbers between Pandas and Python is enough to yield a different result.
I also noted that Pandas yields a different result, if the order of the numbers is reversed, in opposition to standard sum()
in Python:
>>> Decimal(sum(reversed(numbers)))
Decimal('60.44500000000000028421709430404007434844970703125'). # the same as unreversed
>>> Decimal(float(pd.DataFrame(reversed(numbers)).sum()))
Decimal('60.44499999999998607336237910203635692596435546875'). # different from unreversed
The difference in the result of the sum on revered and unreversed list is tiny. But I thought up to now, that floating point addition should be commutative. That doesn't seem to be the case for Pandas.
So why does Pandas sum()
yield different results than Python sum()
for floating point numbers? Why does it yield a different result, when simply reverting the numbers? Is that a bug or is that a feature of floating point addition with Pandas? (Or is that related to my underlying hardware? I'm using Python 3.12 with Pandas 2.2.2 on Mac OS 14.1.1 with Apple M3 Pro Chip)
My guess:
Pandas is internally using functions / methods from the NumPy lib, in such way that they are still relying on NumPy.sum
Always spotting this little remark on Pandas sum
methods:
This is equivalent to the method numpy.sum
There you can find a little note on floating point numbers, which hopefully clarifies the results' discrepancy between both sum
methods.
For floating point numbers the numerical precision of sum (and np.add.reduce) is in general limited by directly adding each number individually to the result causing rounding errors in every step. However, often numpy will use a numerically better approach (partial pairwise summation) leading to improved precision in many use-cases. This improved precision is always provided when no axis is given. When axis is given, it will depend on which axis is summed. Technically, to provide the best speed possible, the improved precision is only used when the summation is along the fast axis in memory. Note that the exact precision may vary depending on other parameters. In contrast to NumPy, Python’s math.fsum function uses a slower but more precise approach to summation. Especially when summing a large number of lower precision floating point numbers, such as float32, numerical errors can become significant. In such cases it can be advisable to use dtype=”float64” to use a higher precision for the output.
Let us know if Python's math.fsum is getting more precise than Pandas' result.