Search code examples
pythonpandasrolling-computationdatetimeoffset

Most efficient way to do rolling window on a datetimeindex with a data offset from the index


I am trying to calculate statistics over a shifted/offset rolling window of an inconsistent datetimeindex of a dataset in a pandas dataframe. I want to bring these statistics back to the current datetimeindex. I have a solution but it is computationally inefficient and impractical to run over my large dataset of millions of rows.

Here is a sample of what I want and my method to achieve it.

df = pd.DataFrame({'Col1': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]},
                   index=pd.DatetimeIndex(['2022-05-25T00:20:00.930','2022-05-25T00:20:01.257','2022-05-25T00:20:01.673','2022-05-25T00:20:03.125','2022-05-25T00:20:04.190',
                                           '2022-05-25T00:20:04.555','2022-05-25T00:20:04.923','2022-05-25T00:20:05.773','2022-05-25T00:20:05.989','2022-05-25T00:20:06.224'],yearfirst=True))

df:   
         Index             Col1
    2022-05-25 00:20:00.930    10
    2022-05-25 00:20:01.257    15
    2022-05-25 00:20:01.673    20
    2022-05-25 00:20:03.125    25
    2022-05-25 00:20:04.190    30
    2022-05-25 00:20:04.555    35
    2022-05-25 00:20:04.923    40
    2022-05-25 00:20:05.773    45
    2022-05-25 00:20:05.989    50
    2022-05-25 00:20:06.224    55

With the above dataset, this is my method to get a shifted rolling window at each index.

df['Col1 Avg'] = 0.0

for row in df.index:
   
    offset_t = datetime.timedelta(seconds=1.5)
    window_t = datetime.timedelta(seconds=1)
    beg = row-offset_t-window_t
    end = row-offset_t+window_t
    
    df['Col1 Avg'].loc[row:row] = df['Col1'].loc[beg:end].mean()

df:
              Index           Col1  Col1 Avg
    2022-05-25 00:20:00.930    10      NaN
    2022-05-25 00:20:01.257    15      NaN
    2022-05-25 00:20:01.673    20     10.0
    2022-05-25 00:20:03.125    25     15.0
    2022-05-25 00:20:04.190    30     25.0
    2022-05-25 00:20:04.555    35     25.0
    2022-05-25 00:20:04.923    40     27.5
    2022-05-25 00:20:05.773    45     35.0
    2022-05-25 00:20:05.989    50     35.0
    2022-05-25 00:20:06.224    55     35.0

Is there a way to do this more efficiently? This takes ~5 minutes for just 10,0000 rows whereas a standard rolling window is <0.05 seconds.

Something like this seems like it should work but doesn't (I think) because of the inconsistent datetimeindex entries.

df['shifted avg'] = df['Col1'].shift(-1,freq=offset_t).rolling('2s').mean()

df:

          Index            Col1 Col1 Avg    shifted avg
2022-05-25 00:20:00.930    10     NaN            NaN
2022-05-25 00:20:01.257    15     NaN            NaN
2022-05-25 00:20:01.673    20    10.0            NaN
2022-05-25 00:20:03.125    25    15.0            NaN
2022-05-25 00:20:04.190    30    25.0            NaN
2022-05-25 00:20:04.555    35    25.0            NaN
2022-05-25 00:20:04.923    40    27.5            NaN
2022-05-25 00:20:05.773    45    35.0            NaN
2022-05-25 00:20:05.989    50    35.0            NaN
2022-05-25 00:20:06.224    55    35.0            NaN

Solution

  • If you resample to 1ms, you can then take a 2 second rolling and offset by 500ms. Then since you have a record for every ms, and that's the resolution of your original index, you can merge them together to get the correct answers.

    import pandas as pd
    
    df = pd.DataFrame({'Col1': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]},
                       index=pd.DatetimeIndex(['2022-05-25T00:20:00.930','2022-05-25T00:20:01.257','2022-05-25T00:20:01.673','2022-05-25T00:20:03.125','2022-05-25T00:20:04.190',
                                               '2022-05-25T00:20:04.555','2022-05-25T00:20:04.923','2022-05-25T00:20:05.773','2022-05-25T00:20:05.989','2022-05-25T00:20:06.224'],yearfirst=True))
    
    
    df = df.merge(df.resample('1ms')
                    .min()
                    .rolling('2S')
                    .mean()
                    .shift(500)
                    .rename(columns={'Col1':'Col1 Avg'}),
                  left_index=True, 
                  right_index=True)
    
    print(df)
    

    Output

                            Col1  Col1 Avg
    2022-05-25 00:20:00.930    10       NaN
    2022-05-25 00:20:01.257    15       NaN
    2022-05-25 00:20:01.673    20      10.0
    2022-05-25 00:20:03.125    25      15.0
    2022-05-25 00:20:04.190    30      25.0
    2022-05-25 00:20:04.555    35      25.0
    2022-05-25 00:20:04.923    40      27.5
    2022-05-25 00:20:05.773    45      35.0
    2022-05-25 00:20:05.989    50      35.0
    2022-05-25 00:20:06.224    55      35.0