I have a DataFrame
with key
and value
columns. value
is sometimes NA:
df = pd.DataFrame({
'key': np.random.randint(0, 1_000_000, 100_000_000),
'value': np.random.randint(0, 1_000, 100_000_000).astype(float),
})
df.loc[df.value == 0, 'value'] = np.nan
I want to group by key
and sum over the value
column. If any value
is NA for a key
, I want the sum to be NA.
The code in this answer takes 35.7 seconds on my machine:
df.groupby('key')['value'].apply(np.array).apply(np.sum)
This is a lot slower than what is theoretically possible. The built-in Pandas SeriesGroupBy.sum
takes 6.31 seconds on my machine:
df.groupby('key')['value'].sum()
but it doesn't support NA handling (see this GitHub issue).
What code can I write to get comparable performance to the built-in operator while still handling NaNs?
One workaround could be to replace the NaNs by Inf:
df.fillna({'value': np.inf}).groupby('key')['value'].sum().replace(np.inf, np.nan)
Faster alternative:
df['value'].fillna(np.inf).groupby(df['key']).sum().replace(np.inf, np.nan)
Example output:
key
0 45208.0
1 NaN
2 62754.0
3 50001.0
4 51073.0
...
99995 55102.0
99996 43048.0
99997 49497.0
99998 43301.0
99999 NaN
Name: value, Length: 100000, dtype: float64
Timing (on 10m rows).
# original sum
743 ms ± 81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Inf workaround
918 ms ± 70.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Inf workaround (alternative)
773 ms ± 60.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# custom apply with numpy
5.99 s ± 263 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)