Let's say I have the following dataframe:
Code | Price |
---|---|
AA1 | 10 |
AA1 | 20 |
BB2 | 30 |
And I want to perform the following operation on it:
df.groupby("code").aggregate({
"price": "sum"
})
I have tried playing with the new pyarrow dtypes introduced in Pandas 2.0 and I created 3 copies, and for each copy I measured execution time (average of 5 executions) of the operation above.
Code column dtype | Price column dtype | Execution time |
---|---|---|
Object | float64 | 2.94 s |
string[pyarrow] | double[pyarrow] | 49.5 s |
string[pyarrow] | float64 | 1.11 s |
Can anyone explain why applying an aggregate function on a column with double pyarrow dtype is so slow compared to the standard numpy float64 dtype?
https://github.com/pandas-dev/pandas/issues/52070
Looks like groupby for arrow isn't implemented yet - so there's likely a arrow -> numpy happening internally leading to a loss of performance.