I'm doing some work with a fairly large amount of (horse racing!) data for a project, calculating rolling sums of values for various different combinations of data - thus I need to streamline it as much as possible.
Essentially I am:
The rolling window calculation below works fine - but takes 8.2s [this is about 1/8 of the total dataset - hence would take 1m 5s]. I am looking for ideas of how to streamline this calculation as I'm looking to do it for a number of different combinations of data, and thus speed is of the essence. Thanks.
import pandas as pd
import time
url = 'https://raw.githubusercontent.com/richsdixon/testdata/main/testdata.csv'
df = pd.read_csv(url, parse_dates=True)
df['RaceDate'] = pd.to_datetime(df['RaceDate'], format='mixed')
df.sort_values(by='RaceDate', inplace=True)
df['HorseRaceCount90d'] = (df.groupby(['Horse','Trainer'], group_keys=False)
.apply(lambda x: x.rolling(window='180D', on='RaceDate', min_periods=1)['Points'].mean()))
Here is a little faster way:
df.merge(df.set_index('RaceDate')
.groupby(['Horse', 'Trainer'])['Points']
.rolling('180D')
.mean()
.rename('HorseRaceCount90d_1'),
right_index=True,
left_on=['Horse', 'Trainer', 'RaceDate'])
Your way:
18.5 s ± 727 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This way:
1.18 s ± 36.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Quick Comparison:
RaceID RaceDate Horse Jockey Trainer Points HorseRaceCount90d HorseRaceCount90d_1
0 678365 2017-01-08 STRADIVARIUS Andrea Atzeni John Gosden 100.00 100.000000 100.000000
289 680610 2017-01-08 CLASSIC PALACE Brian Hughes Dianne Sayer 1.76 1.760000 1.760000
288 680610 2017-01-08 ROMAN NUMERAL Joe Colliver David Thompson 0.00 0.000000 0.000000
287 680610 2017-01-08 COOPER'S FRIEND Conor O'Farrell R Mike Smith 0.00 0.000000 0.000000
286 680610 2017-01-08 GLEANN NA NDOCHAIS Craig Nichol Alistair Whillans 0.00 0.000000 0.000000
... ... ... ... ... ... ... ... ...
96817 702712 2018-12-06 URBAN ICON Tom Marquand Richard Hannon 12.50 12.500000 12.500000
96816 702712 2018-12-06 EVEN KEEL Rob Hornby Jonathan Portman 11.07 11.070000 11.070000
96815 702712 2018-12-06 MOLLY BLAKE Hector Crouch Clive Cox 9.73 9.730000 9.730000
96885 702719 2018-12-06 CELTIC ARTISAN Cam Hardie Rebecca Menzies 2.37 1.046667 1.046667
97076 704968 2018-12-06 REVENGE David Allan Tim Easterby 2.47 1.346667 1.346667
[100008 rows x 8 columns]