Search code examples
pythonpandasdataframemulti-index

Create a multi-index dataframe from a 2D data frame - Turning a multivariate time-series array to lagged multi-index dataframe


I am trying to turn a 2D multivariate time-series array of shape (n_samples, n_channels) into a DataFrame with a multi-index column given a maximum time-lag. The multi-index column would have the 'channel' at the first level and then the time-lag on the second level. Each row of the resulting dataframe would consist of one time window of the dataset.

Example

I have a dataframe like the following, which is structured as (# of samples, number of variables) along the rows and columns. x1 and x2 are collected over time, so each row is a new time point.

  x1    x2 
0  0.23  0.32
1  0.80  0.68
2  0.39  0.33 
3  0.2   0.4
4  -1.   -1
5  5   8

I have a given "max lag" say value 2, which I would like to create a multi-index column of x1: [lag0, lag1, lag2], x2: [lag0, lag1, lag2], .... The entries in the resulting dataframe would come from every nth component, where n is the max lag. So x1-lag0 row 1 would have the value 0.23, but x1-lag0 row 2 would have value 0.2.

I would like to convert my original dataframe into this new dataframe of (variable, lags) as a multi-index column and # of windowed-samples as the rows.

For the above example, the result would be:

 x1                x2                          
   lag0  lag1  lag2  lag0  lag1  lag2 
0  0.39  0.80  0.23  0.33  0.68  0.32 
1  5      -1    0.2   8     -1.   0.4

Note: the # of samples in the rows would now be the original # of samples / the maximum lag.


Solution

  • You can create new columns for final index and second level of MultiIndex by integer division by // and counter by GroupBy.cumcount, pivoting by DataFrame.pivot and last rename second level of MultiIndex by lag:

    max_lag = 2
    
    arr = np.arange(len(df))
    df['g'] = arr // (max_lag + 1)
    df['lag'] = df.groupby('g').cumcount(ascending=False)
    print (df)
         x1    x2  g  lag
    0  0.23  0.32  0    2
    1  0.80  0.68  0    1
    2  0.39  0.33  0    0
    3  0.20  0.40  1    2
    4 -1.00 -1.00  1    1
    5  5.00  8.00  1    0
    
    
    df1 = df.pivot(index='g',columns='lag').rename(lambda x: f'lag{x}', level=1, axis=1)
    print (df1)
           x1               x2            
    lag  lag0 lag1  lag2  lag0  lag1  lag2
    g                                     
    0    0.39  0.8  0.23  0.33  0.68  0.32
    1    5.00 -1.0  0.20  8.00 -1.00  0.40
    

    Test solution with if # of rows that are not nicely divisible by max_lag:

    print (df)
         x1    x2
    0  0.23  0.32
    1  0.80  0.68
    2  0.39  0.33
    3  0.20  0.40
    4 -1.00 -1.00
    5  5.00  8.00
    6  1.00  2.00
    7  5.00  7.00
    

    max_lag = 2
    
    arr = np.arange(len(df))
    df['g'] = arr // (max_lag + 1)
    df['lag'] = df.groupby('g').cumcount(ascending=False)
    print (df)
         x1    x2  g  lag
    0  0.23  0.32  0    2
    1  0.80  0.68  0    1
    2  0.39  0.33  0    0
    3  0.20  0.40  1    2
    4 -1.00 -1.00  1    1
    5  5.00  8.00  1    0
    6  1.00  2.00  2    1
    7  5.00  7.00  2    0
    
    df1 = df.pivot(index='g',columns='lag').rename(lambda x: f'lag{x}', level=1, axis=1)
    print (df1)
           x1               x2            
    lag  lag0 lag1  lag2  lag0  lag1  lag2
    g                                     
    0    0.39  0.8  0.23  0.33  0.68  0.32
    1    5.00 -1.0  0.20  8.00 -1.00  0.40
    2    5.00  1.0   NaN  7.00  2.00   NaN