I am trying to turn a 2D multivariate time-series array of shape (n_samples, n_channels)
into a DataFrame with a multi-index column given a maximum time-lag. The multi-index column would have the 'channel' at the first level and then the time-lag on the second level. Each row of the resulting dataframe would consist of one time window of the dataset.
I have a dataframe like the following, which is structured as (# of samples, number of variables) along the rows and columns. x1
and x2
are collected over time, so each row is a new time point.
x1 x2
0 0.23 0.32
1 0.80 0.68
2 0.39 0.33
3 0.2 0.4
4 -1. -1
5 5 8
I have a given "max lag" say value 2, which I would like to create a multi-index column of x1: [lag0, lag1, lag2], x2: [lag0, lag1, lag2], ...
. The entries in the resulting dataframe would come from every nth component, where n is the max lag. So x1-lag0
row 1 would have the value 0.23, but x1-lag0
row 2 would have value 0.2.
I would like to convert my original dataframe into this new dataframe of (variable, lags) as a multi-index column and # of windowed-samples as the rows.
For the above example, the result would be:
x1 x2
lag0 lag1 lag2 lag0 lag1 lag2
0 0.39 0.80 0.23 0.33 0.68 0.32
1 5 -1 0.2 8 -1. 0.4
Note: the # of samples in the rows would now be the original # of samples / the maximum lag.
You can create new columns for final index and second level of MultiIndex
by integer division by //
and counter by GroupBy.cumcount
, pivoting by DataFrame.pivot
and last rename second level of MultiIndex
by lag
:
max_lag = 2
arr = np.arange(len(df))
df['g'] = arr // (max_lag + 1)
df['lag'] = df.groupby('g').cumcount(ascending=False)
print (df)
x1 x2 g lag
0 0.23 0.32 0 2
1 0.80 0.68 0 1
2 0.39 0.33 0 0
3 0.20 0.40 1 2
4 -1.00 -1.00 1 1
5 5.00 8.00 1 0
df1 = df.pivot(index='g',columns='lag').rename(lambda x: f'lag{x}', level=1, axis=1)
print (df1)
x1 x2
lag lag0 lag1 lag2 lag0 lag1 lag2
g
0 0.39 0.8 0.23 0.33 0.68 0.32
1 5.00 -1.0 0.20 8.00 -1.00 0.40
Test solution with if # of rows that are not nicely divisible by max_lag
:
print (df)
x1 x2
0 0.23 0.32
1 0.80 0.68
2 0.39 0.33
3 0.20 0.40
4 -1.00 -1.00
5 5.00 8.00
6 1.00 2.00
7 5.00 7.00
max_lag = 2
arr = np.arange(len(df))
df['g'] = arr // (max_lag + 1)
df['lag'] = df.groupby('g').cumcount(ascending=False)
print (df)
x1 x2 g lag
0 0.23 0.32 0 2
1 0.80 0.68 0 1
2 0.39 0.33 0 0
3 0.20 0.40 1 2
4 -1.00 -1.00 1 1
5 5.00 8.00 1 0
6 1.00 2.00 2 1
7 5.00 7.00 2 0
df1 = df.pivot(index='g',columns='lag').rename(lambda x: f'lag{x}', level=1, axis=1)
print (df1)
x1 x2
lag lag0 lag1 lag2 lag0 lag1 lag2
g
0 0.39 0.8 0.23 0.33 0.68 0.32
1 5.00 -1.0 0.20 8.00 -1.00 0.40
2 5.00 1.0 NaN 7.00 2.00 NaN