Search code examples
pythonpandasdataframerolling-computation

Subtracting a rolling window mean based on value from one column based on another without loops in Pandas


I'm not sure what the word is for what I'm doing, but I can't just use the pandas rolling (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html) function because the window is not a fixed size in terms of database indices. What I'm trying to do this:

I have a dataframe with columns UT (time in hours, but not a datetime object) and WINDS, I want to add a third column that subtracts the mean of all WINDS values that are within 12 hours of the time in the UT column. Currently, I do it like this:

rolsub = []

for i in df['UT']:
    df1 = df[ (df['UT'] > (i-12)) & (df['UT'] < (i+12)) ]
    df2 = df[df['UT'] == i]
    rolsub +=  [float(df2['WINDS'] - df1['WINDS'].mean())]

df['WIND_SUB'] = rolsub

This works fine, but it takes way too long since my dataframe has tens of thousands of entries. There must be a better way to do this, right? Please help!


Solution

  • If I understood correctly, you could create a fake DatetimeIndex to use for rolling.

    Example data:

    import pandas as pd
    
    df = pd.DataFrame({'UT':[0.5, 1, 2, 8, 9, 12, 13, 14, 15, 24, 60, 61, 63, 100],
                       'WINDS':[1, 1, 10, 1, 1, 1, 5, 5, 5, 5, 5, 1, 1, 10]})
    
    print(df)
    
           UT  WINDS
    0     0.5      1
    1     1.0      1
    2     2.0     10
    3     8.0      1
    4     9.0      1
    5    12.0      1
    6    13.0      5
    7    14.0      5
    8    15.0      5
    9    24.0      5
    10   60.0      5
    11   61.0      1
    12   63.0      1
    13  100.0     10
    

    Code:

    # Fake DatetimeIndex.
    df['dt'] = pd.to_datetime('today').normalize() + pd.to_timedelta(df['UT'], unit='h')
    df = df.set_index('dt')
    
    df['WINDS_SUB'] = df['WINDS'] - df['WINDS'].rolling('24h', center=True, closed='neither').mean()
    
    print(df)
    

    Which gives:

                            UT  WINDS  WINDS_SUB
    dt                                          
    2022-05-11 00:30:00    0.5      1  -1.500000
    2022-05-11 01:00:00    1.0      1  -1.500000
    2022-05-11 02:00:00    2.0     10   7.142857
    2022-05-11 08:00:00    8.0      1  -2.333333
    2022-05-11 09:00:00    9.0      1  -2.333333
    2022-05-11 12:00:00   12.0      1  -2.333333
    2022-05-11 13:00:00   13.0      5   0.875000
    2022-05-11 14:00:00   14.0      5   1.714286
    2022-05-11 15:00:00   15.0      5   1.714286
    2022-05-12 00:00:00   24.0      5   0.000000
    2022-05-13 12:00:00   60.0      5   2.666667
    2022-05-13 13:00:00   61.0      1  -1.333333
    2022-05-13 15:00:00   63.0      1  -1.333333
    2022-05-15 04:00:00  100.0     10   0.000000
    

    The result on this small test set matches the output of your code. This assumes UT is representing hours from a certain start timepoint, which seems to be the case by looking at your solution.

    Runtime:

    I tested it on the following df with 30,000 rows:

    import numpy as np
    
    df = pd.DataFrame({'UT':range(30000),
                       'WINDS':np.full(30000, 1)})
    
    def loop(df):
        rolsub = []
    
        for i in df['UT']:
            df1 = df[ (df['UT'] > (i-12)) & (df['UT'] < (i+12)) ]
            df2 = df[df['UT'] == i]
            rolsub +=  [float(df2['WINDS'] - df1['WINDS'].mean())]
    
        df['WIND_SUB'] = rolsub
    
    def vector(df):
        df['dt'] = pd.to_datetime('today').normalize() + pd.to_timedelta(df['UT'], unit='h')
        df = df.set_index('dt')
    
        df['WINDS_SUB'] = df['WINDS'] - df['WINDS'].rolling('24h', center=True, closed='neither').mean()
    
        return df
    
    # 10.1 s ± 171 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    %timeit loop(df)
    
    # 1.69 ms ± 71.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    %timeit vector(df)
    

    So it's about 5,000 times faster.