Search code examples
pythonpandasperformancegroup-by

How can I quickly sum over a Pandas groupby object while handling NaNs?


I have a DataFrame with key and value columns. value is sometimes NA:

df = pd.DataFrame({
    'key': np.random.randint(0, 1_000_000, 100_000_000),
    'value': np.random.randint(0, 1_000, 100_000_000).astype(float),
})
    
df.loc[df.value == 0, 'value'] = np.nan

I want to group by key and sum over the value column. If any value is NA for a key, I want the sum to be NA.

The code in this answer takes 35.7 seconds on my machine:

df.groupby('key')['value'].apply(np.array).apply(np.sum)

This is a lot slower than what is theoretically possible. The built-in Pandas SeriesGroupBy.sum takes 6.31 seconds on my machine:

df.groupby('key')['value'].sum()

but it doesn't support NA handling (see this GitHub issue).

What code can I write to get comparable performance to the built-in operator while still handling NaNs?


Solution

  • One workaround could be to replace the NaNs by Inf:

    df.fillna({'value': np.inf}).groupby('key')['value'].sum().replace(np.inf, np.nan)
    

    Faster alternative:

    df['value'].fillna(np.inf).groupby(df['key']).sum().replace(np.inf, np.nan)
    

    Example output:

    key
    0        45208.0
    1            NaN
    2        62754.0
    3        50001.0
    4        51073.0
              ...   
    99995    55102.0
    99996    43048.0
    99997    49497.0
    99998    43301.0
    99999        NaN
    Name: value, Length: 100000, dtype: float64
    

    Timing (on 10m rows).

    # original sum
    743 ms ± 81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # Inf workaround
    918 ms ± 70.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # Inf workaround (alternative)
    773 ms ± 60.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # custom apply with numpy
    5.99 s ± 263 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)