Search code examples
pythonpandasperformancerolling-computation

How to calculate mulitple rolling windows on one column in pandas quickly


I am currently trying to calculate the rolling average of one column in my pandas dataframe over many rolling periods. My dataframe has one column of interest where I wish to calculate a rolling average from 2-40 periods and have the same dataframe and indexes know these values. This has proven a bit to slow as my dataframe has ~6,000,000 rows these windows are calculated over.

I've provided some sample data at the bottom that can be copy/pasted into the pd.DataFrame method. That is what my variable "df" stores

Current solution

df = pd.DataFrame(*dictionary at thebottom*)
for i in range(2, 41):
    df[f'roll_{i}'] = df['col1'].rolling(i).mean()

Other methods:

I've tried giving .mean the engine='pyarrow' parameter, but that doesn't seem to do much. Can someone help point me to speeding this calculation up?

The Data

{Timestamp('2022-10-18 16:10:00'): 18.1065,
  Timestamp('2022-10-18 16:11:00'): 18.120449999999998,
 Timestamp('2022-10-18 16:12:00'): 18.1293,
 Timestamp('2022-10-18 16:13:00'): 18.13035,
 Timestamp('2022-10-18 16:14:00'): 18.1245,
 Timestamp('2022-10-18 16:15:00'): 18.1049,
 Timestamp('2022-10-18 16:16:00'): 18.1014,
 Timestamp('2022-10-18 16:17:00'): 18.103499999999997,
 Timestamp('2022-10-18 16:18:00'): 18.09375,
 Timestamp('2022-10-18 16:19:00'): 18.0906,
 Timestamp('2022-10-18 16:20:00'): 18.092699999999997,
 Timestamp('2022-10-18 16:21:00'): 18.0855,
 Timestamp('2022-10-18 16:22:00'): 18.055349999999997,
 Timestamp('2022-10-18 16:23:00'): 18.05745,
 Timestamp('2022-10-18 16:24:00'): 18.06645,
 Timestamp('2022-10-18 16:25:00'): 18.06945,
 Timestamp('2022-10-18 16:26:00'): 18.06465,
 Timestamp('2022-10-18 16:27:00'): 18.062549999999998,
 Timestamp('2022-10-18 16:28:00'): 18.06645,
 Timestamp('2022-10-18 16:29:00'): 18.060449999999996,
 Timestamp('2022-10-18 16:30:00'): 18.042675,
 Timestamp('2022-10-18 16:31:00'): 18.046349999999997,
 Timestamp('2022-10-18 16:32:00'): 18.0456,
 Timestamp('2022-10-18 16:33:00'): 18.0444,
 Timestamp('2022-10-18 16:34:00'): 18.039150000000003,
 Timestamp('2022-10-18 16:35:00'): 18.040200000000002,
 Timestamp('2022-10-18 16:36:00'): 18.039675000000003,
 Timestamp('2022-10-18 16:37:00'): 18.0423,
 Timestamp('2022-10-18 16:38:00'): 18.044249999999998,
 Timestamp('2022-10-18 16:39:00'): 18.044249999999998,
 Timestamp('2022-10-18 16:40:00'): 18.04035,
 Timestamp('2022-10-18 16:41:00'): 18.0414,
 Timestamp('2022-10-18 16:42:00'): 18.040499999999998,
 Timestamp('2022-10-18 16:43:00'): 18.037349999999996,
 Timestamp('2022-10-18 16:44:00'): 18.0213,
 Timestamp('2022-10-18 16:45:00'): 18.01455,
 Timestamp('2022-10-18 16:46:00'): 18.031200000000002,
 Timestamp('2022-10-18 16:47:00'): 18.03225,
 Timestamp('2022-10-18 16:48:00'): 18.02055,
 Timestamp('2022-10-18 16:49:00'): 18.001875000000002,
 Timestamp('2022-10-18 16:50:00'): 18.01735,
 Timestamp('2022-10-18 16:51:00'): 18.02295,
 Timestamp('2022-10-18 16:52:00'): 18.024,
 Timestamp('2022-10-18 16:53:00'): 18.028200000000002,
 Timestamp('2022-10-18 16:54:00'): 18.02295,
 Timestamp('2022-10-18 16:55:00'): 18.02505,
 Timestamp('2022-10-18 16:56:00'): 18.0219,
 Timestamp('2022-10-18 16:57:00'): 18.0177,
 Timestamp('2022-10-18 16:58:00'): 18.03225,
 Timestamp('2022-10-18 16:59:00'): 18.0375}
 Timestamp('2022-10-18 16:57:00'): 18.0177,
 Timestamp('2022-10-18 16:58:00'): 18.03225,
 Timestamp('2022-10-18 16:59:00'): 18.0375}

Solution

  • I got 4-5x speed improvement using bottleneck package

    import bottleneck
    
    @profile
    def main():
        df = pd.DataFrame(pd.Series(data), columns=["col1"])
        df = df.loc[df.index.repeat(10000)]
        df2 = df.copy(deep=True)
        fasterMethod(df2)
        standardMethod(df)
    
    
    def standardMethod(df):
        for i in range(2, 41):
            df[f'roll_{i}'] = df['col1'].rolling(i).mean()
    
    
    def fasterMethod(df):
        values = df['col1'].values
        for i in range(2, 41):
            df[f'roll_{i}'] = bottleneck.move_mean(values, i)
    

    enter image description here

    There are differences between results at numerical precision level (10^-14 or so).

    In case someone wanted to improve on my answer, here is full code:

    import pandas as pd
    from line_profiler_pycharm import profile
    import bottleneck
    
    
    data = {
        pd.Timestamp('2022-10-18 16:10:00'): 18.1065,
        pd.Timestamp('2022-10-18 16:11:00'): 18.120449999999998,
        pd.Timestamp('2022-10-18 16:12:00'): 18.1293,
        pd.Timestamp('2022-10-18 16:13:00'): 18.13035,
        pd.Timestamp('2022-10-18 16:14:00'): 18.1245,
        pd.Timestamp('2022-10-18 16:15:00'): 18.1049,
        pd.Timestamp('2022-10-18 16:16:00'): 18.1014,
        pd.Timestamp('2022-10-18 16:17:00'): 18.103499999999997,
        pd.Timestamp('2022-10-18 16:18:00'): 18.09375,
        pd.Timestamp('2022-10-18 16:19:00'): 18.0906,
        pd.Timestamp('2022-10-18 16:20:00'): 18.092699999999997,
        pd.Timestamp('2022-10-18 16:21:00'): 18.0855,
        pd.Timestamp('2022-10-18 16:22:00'): 18.055349999999997,
        pd.Timestamp('2022-10-18 16:23:00'): 18.05745,
        pd.Timestamp('2022-10-18 16:24:00'): 18.06645,
        pd.Timestamp('2022-10-18 16:25:00'): 18.06945,
        pd.Timestamp('2022-10-18 16:26:00'): 18.06465,
        pd.Timestamp('2022-10-18 16:27:00'): 18.062549999999998,
        pd.Timestamp('2022-10-18 16:28:00'): 18.06645,
        pd.Timestamp('2022-10-18 16:29:00'): 18.060449999999996,
        pd.Timestamp('2022-10-18 16:30:00'): 18.042675,
        pd.Timestamp('2022-10-18 16:31:00'): 18.046349999999997,
        pd.Timestamp('2022-10-18 16:32:00'): 18.0456,
        pd.Timestamp('2022-10-18 16:33:00'): 18.0444,
        pd.Timestamp('2022-10-18 16:34:00'): 18.039150000000003,
        pd.Timestamp('2022-10-18 16:35:00'): 18.040200000000002,
        pd.Timestamp('2022-10-18 16:36:00'): 18.039675000000003,
        pd.Timestamp('2022-10-18 16:37:00'): 18.0423,
        pd.Timestamp('2022-10-18 16:38:00'): 18.044249999999998,
        pd.Timestamp('2022-10-18 16:39:00'): 18.044249999999998,
        pd.Timestamp('2022-10-18 16:40:00'): 18.04035,
        pd.Timestamp('2022-10-18 16:41:00'): 18.0414,
        pd.Timestamp('2022-10-18 16:42:00'): 18.040499999999998,
        pd.Timestamp('2022-10-18 16:43:00'): 18.037349999999996,
        pd.Timestamp('2022-10-18 16:44:00'): 18.0213,
        pd.Timestamp('2022-10-18 16:45:00'): 18.01455,
        pd.Timestamp('2022-10-18 16:46:00'): 18.031200000000002,
        pd.Timestamp('2022-10-18 16:47:00'): 18.03225,
        pd.Timestamp('2022-10-18 16:48:00'): 18.02055,
        pd.Timestamp('2022-10-18 16:49:00'): 18.001875000000002,
        pd.Timestamp('2022-10-18 16:50:00'): 18.01735,
        pd.Timestamp('2022-10-18 16:51:00'): 18.02295,
        pd.Timestamp('2022-10-18 16:52:00'): 18.024,
        pd.Timestamp('2022-10-18 16:53:00'): 18.028200000000002,
        pd.Timestamp('2022-10-18 16:54:00'): 18.02295,
        pd.Timestamp('2022-10-18 16:55:00'): 18.02505,
        pd.Timestamp('2022-10-18 16:56:00'): 18.0219,
        pd.Timestamp('2022-10-18 16:57:00'): 18.0177,
        pd.Timestamp('2022-10-18 16:58:00'): 18.03225,
        pd.Timestamp('2022-10-18 16:59:00'): 18.0375}
    
    
    @profile
    def main():
        df = pd.DataFrame(pd.Series(data), columns=["col1"])
        df = df.loc[df.index.repeat(10000)]
        df2 = df.copy(deep=True)
        fasterMethod(df2)
        standardMethod(df)
    
    
    def standardMethod(df):
        for i in range(2, 41):
            df[f'roll_{i}'] = df['col1'].rolling(i).mean()
    
    
    def fasterMethod(df):
        values = df['col1'].values
        for i in range(2, 41):
            df[f'roll_{i}'] = bottleneck.move_mean(values, i)
    
    
    if __name__ == '__main__':
        main()