Search code examples
pandasdataframetime-seriesrolling-computation

Creating a dataframe using roll-forward window on multivariate time series


Based on the simplifed sample dataframe

import pandas as pd
import numpy as np
timestamps = pd.date_range(start='2017-01-01', end='2017-01-5', inclusive='left')
values = np.arange(0,len(timestamps))
df = pd.DataFrame({'A': values ,'B' : values*2},
                       index = timestamps )
print(df)

            A  B
2017-01-01  0  0
2017-01-02  1  2
2017-01-03  2  4
2017-01-04  3  6

I want to use a roll-forward window of size 2 with a stride of 1 to create a resulting dataframe like

     timestep_1  timestep_2  target  
0  A 0           1           2         
   B 0           2           4         
1  A 1           2           3
   B 2           4           6

I.e., each window step should create a data item with the two values of A and B in this window and the A and B values immediately to the right of the window as target values.

My first idea was to use pandas

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html

But that seems to only work in combination with aggregate functions such as sum, which is a different use case.

Any ideas on how to implement this rolling-window-based sampling approach?


Solution

  • Here is one way to do it:

    window_size = 3
    
    new_df = pd.concat(
        [
            df.iloc[i : i + window_size, :]
            .T.reset_index()
            .assign(other_index=i)
            .set_index(["other_index", "index"])
            .set_axis([f"timestep_{j}" for j in range(1, window_size)] + ["target"], axis=1)
            for i in range(df.shape[0] - window_size + 1)
        ]
    )
    
    new_df.index.names = ["", ""]
    
    print(df)
    # Output
         timestep_1  timestep_2  target
    0 A           0           1       2
      B           0           2       4
    1 A           1           2       3
      B           2           4       6