Search code examples
pythonpandasrolling-computation

Speeding up a rolling sum calculation?


I'm doing some work with a fairly large amount of (horse racing!) data for a project, calculating rolling sums of values for various different combinations of data - thus I need to streamline it as much as possible.

Essentially I am:

  • calculating the rolling calculation of a points field over time
  • calculating this for various grouped combinations of data [in this case the combination of horse and trainer]
  • looking at the average of the value by group for the last 180 days of data through time

The rolling window calculation below works fine - but takes 8.2s [this is about 1/8 of the total dataset - hence would take 1m 5s]. I am looking for ideas of how to streamline this calculation as I'm looking to do it for a number of different combinations of data, and thus speed is of the essence. Thanks.

import pandas as pd
import time

url = 'https://raw.githubusercontent.com/richsdixon/testdata/main/testdata.csv'
df = pd.read_csv(url, parse_dates=True)

df['RaceDate'] = pd.to_datetime(df['RaceDate'], format='mixed')
df.sort_values(by='RaceDate', inplace=True)
df['HorseRaceCount90d'] = (df.groupby(['Horse','Trainer'], group_keys=False)
                                 .apply(lambda x: x.rolling(window='180D', on='RaceDate', min_periods=1)['Points'].mean()))

Solution

  • Here is a little faster way:

    df.merge(df.set_index('RaceDate')
               .groupby(['Horse', 'Trainer'])['Points']
               .rolling('180D')
               .mean()
               .rename('HorseRaceCount90d_1'), 
             right_index=True, 
             left_on=['Horse', 'Trainer', 'RaceDate'])
    

    Your way:

    18.5 s ± 727 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    This way:

    1.18 s ± 36.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    Quick Comparison:

           RaceID   RaceDate               Horse           Jockey            Trainer  Points  HorseRaceCount90d  HorseRaceCount90d_1
    0      678365 2017-01-08        STRADIVARIUS    Andrea Atzeni        John Gosden  100.00         100.000000           100.000000
    289    680610 2017-01-08      CLASSIC PALACE     Brian Hughes       Dianne Sayer    1.76           1.760000             1.760000
    288    680610 2017-01-08       ROMAN NUMERAL     Joe Colliver     David Thompson    0.00           0.000000             0.000000
    287    680610 2017-01-08     COOPER'S FRIEND  Conor O'Farrell       R Mike Smith    0.00           0.000000             0.000000
    286    680610 2017-01-08  GLEANN NA NDOCHAIS     Craig Nichol  Alistair Whillans    0.00           0.000000             0.000000
    ...       ...        ...                 ...              ...                ...     ...                ...                  ...
    96817  702712 2018-12-06          URBAN ICON     Tom Marquand     Richard Hannon   12.50          12.500000            12.500000
    96816  702712 2018-12-06           EVEN KEEL       Rob Hornby   Jonathan Portman   11.07          11.070000            11.070000
    96815  702712 2018-12-06         MOLLY BLAKE    Hector Crouch          Clive Cox    9.73           9.730000             9.730000
    96885  702719 2018-12-06      CELTIC ARTISAN       Cam Hardie    Rebecca Menzies    2.37           1.046667             1.046667
    97076  704968 2018-12-06             REVENGE      David Allan       Tim Easterby    2.47           1.346667             1.346667
    
    [100008 rows x 8 columns]