Create a multi-index dataframe from a 2D data frame - Turning a multivariate time-series array to lagged multi-index dataframe

I am trying to turn a 2D multivariate time-series array of shape (n_samples, n_channels) into a DataFrame with a multi-index column given a maximum time-lag. The multi-index column would have the 'channel' at the first level and then the time-lag on the second level. Each row of the resulting dataframe would consist of one time window of the dataset.

Example

I have a dataframe like the following, which is structured as (# of samples, number of variables) along the rows and columns. x1 and x2 are collected over time, so each row is a new time point.

  x1    x2 
0  0.23  0.32
1  0.80  0.68
2  0.39  0.33 
3  0.2   0.4
4  -1.   -1
5  5   8

I have a given "max lag" say value 2, which I would like to create a multi-index column of x1: [lag0, lag1, lag2], x2: [lag0, lag1, lag2], .... The entries in the resulting dataframe would come from every nth component, where n is the max lag. So x1-lag0 row 1 would have the value 0.23, but x1-lag0 row 2 would have value 0.2.

I would like to convert my original dataframe into this new dataframe of (variable, lags) as a multi-index column and # of windowed-samples as the rows.

For the above example, the result would be:

 x1                x2                          
   lag0  lag1  lag2  lag0  lag1  lag2 
0  0.39  0.80  0.23  0.33  0.68  0.32 
1  5      -1    0.2   8     -1.   0.4

Note: the # of samples in the rows would now be the original # of samples / the maximum lag.

Solution

You can create new columns for final index and second level of MultiIndex by integer division by // and counter by GroupBy.cumcount, pivoting by DataFrame.pivot and last rename second level of MultiIndex by lag:

max_lag = 2

arr = np.arange(len(df))
df['g'] = arr // (max_lag + 1)
df['lag'] = df.groupby('g').cumcount(ascending=False)
print (df)
     x1    x2  g  lag
0  0.23  0.32  0    2
1  0.80  0.68  0    1
2  0.39  0.33  0    0
3  0.20  0.40  1    2
4 -1.00 -1.00  1    1
5  5.00  8.00  1    0


df1 = df.pivot(index='g',columns='lag').rename(lambda x: f'lag{x}', level=1, axis=1)
print (df1)
       x1               x2            
lag  lag0 lag1  lag2  lag0  lag1  lag2
g                                     
0    0.39  0.8  0.23  0.33  0.68  0.32
1    5.00 -1.0  0.20  8.00 -1.00  0.40

Test solution with if # of rows that are not nicely divisible by max_lag:

print (df)
     x1    x2
0  0.23  0.32
1  0.80  0.68
2  0.39  0.33
3  0.20  0.40
4 -1.00 -1.00
5  5.00  8.00
6  1.00  2.00
7  5.00  7.00

max_lag = 2

arr = np.arange(len(df))
df['g'] = arr // (max_lag + 1)
df['lag'] = df.groupby('g').cumcount(ascending=False)
print (df)
     x1    x2  g  lag
0  0.23  0.32  0    2
1  0.80  0.68  0    1
2  0.39  0.33  0    0
3  0.20  0.40  1    2
4 -1.00 -1.00  1    1
5  5.00  8.00  1    0
6  1.00  2.00  2    1
7  5.00  7.00  2    0

df1 = df.pivot(index='g',columns='lag').rename(lambda x: f'lag{x}', level=1, axis=1)
print (df1)
       x1               x2            
lag  lag0 lag1  lag2  lag0  lag1  lag2
g                                     
0    0.39  0.8  0.23  0.33  0.68  0.32
1    5.00 -1.0  0.20  8.00 -1.00  0.40
2    5.00  1.0   NaN  7.00  2.00   NaN