Search code examples
pandasnumpyaggregation

pandas: fast custom aggregation


I have a time-idexed data that must be resampled:

interval = pd.Timedelta(1/8, "s")
resampled_df = df[["T", "N"]].resample(interval).max()

it works very fast, but I need custom aggregation function (extreme) instead of max

def extreme_agg(array_like):
    # return max or min - which absolute value is greater
    return max(array_like.max(), array_like.min(), key=abs)

interval = pd.Timedelta(1/8, "s")
resampled_df = df[["T", "N"]].resample(interval).apply(extreme_agg)

I tried also

resampled_df = df[["T", "N"]].resample(interval).agg(extreme_agg)

But both ways are terribly slow. Do you have any idea how to make it faster?

Or is there a fast equivalent of my extreme_agg?


Solution

  • You can use change this function working with selected minimal and maximal values by DataFrame.xs, also first are aggregate values by min and max:

    np.random.seed(2021)
    
    N = 10000
    df = pd.DataFrame({'T':np.random.randint(100, size=N),
                       'N':np.random.randint(100, size=N)},
                      index=pd.timedelta_range(0, freq='100ms', periods=N)).sub(50)
    # print (df)
        
    
    def npwhere(df):
        interval = pd.Timedelta(1/8, "s")
        resampled_df = df[["T", "N"]].resample(interval).agg(['max','min'])
        amax = resampled_df.xs('max', axis=1, level=1)
        amin = resampled_df.xs('min', axis=1, level=1)
        return pd.DataFrame(np.where(-amin > amax, amin, amax), 
                                    index=resampled_df.index,
                                    columns=['T','N'])
    
    resampled_df = npwhere(df)
    print (resampled_df.head(10))
    
    def extreme_agg(array_like):
        # return max or min - which absolute value is greater
        return max(array_like.max(), array_like.min(), key=abs)
    
    interval = pd.Timedelta(1/8, "s")
    resampled_df1 = df[["T", "N"]].resample(interval).agg(extreme_agg)
    print (resampled_df1.head(10))
    
    print (resampled_df.equals(resampled_df1))
    True
    

    In [206]: %timeit npwhere(df)
    12.4 ms ± 46.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In [207]: %timeit df[["T", "N"]].resample(interval).agg(lambda x: max(x, key = abs))
    306 ms ± 4.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [208]: %timeit df[["T", "N"]].resample(interval).agg(extreme_agg)
    2.29 s ± 14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)