I'm trying to improve the performances of a pandas.groupby.aggregate
operation using a custom aggregating function. I noticed that - correct me if I'm wrong - pandas
calls the aggregating function on each block in sequence (I suspect it to be a simple for
-loop).
Since pandas
is heavily based on numpy
, is there a way to speed up the calculation using numpy
's vectorization features?
In my code I need to aggregate wind data averaging samples together. While averaging wind-speeds is trivial, averaging wind directions requires a more ad-hoc code (e.g. the average of 1deg and 359deg is 0deg, not 180deg).
What my aggregating function does is:
The function is:
def meandir(x):
'''
Parameters
----------
x : pandas.Series
pandas series to be averaged
Returns
-------
float
averaged wind direction
'''
# Removes the NaN from the recording
x = x.dropna()
# If the record is empty, return NaN
if len(x)==0:
return np.nan
# If the record contains variable samples (990) return variable (990)
elif np.any(x == 990):
return 990
# Otherwise sum the vectors and return the angle
else:
angle = np.rad2deg(
np.arctan2(
np.sum(np.sin(np.deg2rad(x))),
np.sum(np.cos(np.deg2rad(x)))
)
)
#Wrap angles from (-pi,pi) to (0,360)
return (angle + 360) % 360
you can test it with
from timeit import repeat
import pandas as pd
import numpy as np
N_samples = int(1e4)
N_nan = N_var = int(0.02 * N_samples)
# Generate random data
data = np.random.rand(N_samples,2) * [30, 360]
data[np.random.choice(N_samples, N_nan), 1] = np.nan
data[np.random.choice(N_samples, N_var), 1] = 990
# Create dataset
df = pd.DataFrame(data, columns=['WindSpeed', 'WindDir'])
df.index = pd.date_range(start='2000-01-01 00:00', periods=N_samples, freq='10min')
# Run groupby + aggregate
grouped = df.groupby(pd.Grouper(freq='H')) # Data from 14.30 to 15.29 are rounded to 15.00
aggfuns1 = {'WindSpeed': np.mean, 'WindDir':meandir}
aggfuns2 = {'WindSpeed': np.mean, 'WindDir':np.mean}
res = repeat(stmt='grouped.agg(aggfuns1)', globals=globals(), number=1, repeat=10)
print(f'With custom aggregating function {min(res)*1000:.2f} ms')
res = repeat(stmt='grouped.agg(aggfuns2)', globals=globals(), number=1, repeat=10)
print(f'Without custom aggregating function {min(res)*1000:.2f} ms')
which on my PC for N_samples=1e4
outputs:
With custom aggregating function 1500.79 ms
Without custom aggregating function 2.08 ms
with the custom aggregating function being 750 times slower
and with N_samples=1e6
outputs:
With custom aggregating function 142967.17 ms
Without custom aggregating function 21.92 ms
with the custom aggregating function being 6500 times slower!
Is there a way to speed up this line of code?
The key is to try to vectorize everything you can on the whole df
, and let groupby
use only builtin methods.
Here is a way to do that. The trick is to convert the angles to complex numbers, which numpy will happily sum
(and groupby
too, but groupby
will refuse to mean()
). So, we convert the angles to complex
, sum
, then
convert back to angles. The same "funny mean" of angles is used as in your code and described on the Wikipedia page you reference.
About the handling of the special value (990
), it can be vectorized too: comparing s.groupby(...).count()
with .replace(val, nan).groupby(...).count()
finds all the groups where there is at least one of those.
Anyway, here goes:
def to_complex(s):
return np.exp(np.deg2rad(s) * 1j)
def to_angle(s):
return np.angle(s, deg=True) % 360
def mask_val(s, grouper, val=990):
return s.groupby(grouper).count() != s.replace(val, np.nan).groupby(grouper).count()
def myagg(df, grouper, val=990, winddir='WindDir'):
s = df[winddir]
mask = mask_val(s, grouper, val)
gb = to_complex(s).groupby(grouper)
s = gb.sum()
cnt = gb.count()
s = to_angle(s) * (cnt / cnt) # put NaN where all NaNs
s[mask] = val
# other columns
agg = df.groupby(grouper).mean()
agg[winddir] = s
return agg
Application:
For convenience, I put your example generation into a function gen_example(N_samples)
.
df = gen_example(50)
myagg(df, pd.Grouper(freq='H'))
Out[ ]:
WindSpeed WindDir
2000-01-01 00:00:00 12.991717 354.120464
2000-01-01 01:00:00 15.743056 60.813629
2000-01-01 02:00:00 14.593927 245.487383
2000-01-01 03:00:00 17.836368 131.493675
2000-01-01 04:00:00 18.987296 27.150359
2000-01-01 05:00:00 16.415725 194.923399
2000-01-01 06:00:00 20.881816 990.000000
2000-01-01 07:00:00 15.033480 44.626018
2000-01-01 08:00:00 16.276834 29.252459
Speed:
df = gen_example(10_000)
%timeit myagg(df, pd.Grouper(freq='H'))
Out[ ]:
6.76 ms ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df = gen_example(1e6)
%timeit myagg(df, pd.Grouper(freq='H'))
Out[ ]:
189 ms ± 425 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Testing:
idx = [0] * 4
grouper = pd.Grouper(level=0)
myagg(pd.DataFrame({'WindDir': [170, 170, 178, 182]}, index=idx), grouper)
WindDir
0 174.998473
myagg(pd.DataFrame({'WindDir': [330, 359, 1, 40]}, index=idx), grouper)
WindDir
0 2.251499
myagg(pd.DataFrame({'WindDir': [330, 359, 1, np.nan]}, index=idx), grouper)
WindDir
0 350.102878
myagg(pd.DataFrame({'WindDir': [np.nan, np.nan, np.nan, np.nan]}, index=idx), grouper)
WindDir
0 NaN
myagg(pd.DataFrame({'WindDir': [330, 990, 1, np.nan]}, index=idx), grouper)
WindDir
0 990.0