python pandas dataframe multi-index sktime

How to group a multivariate time series dataframe into windows of 20 seconds

Goal

I am trying to prepare data for time series classification using sktime and minirocket. The classification will work on time windows of 20 seconds. The data need to be given to minirocket as multi index pandas dataframes, like this one for example :

id   name    20_s_window
1    A       0.469112 0.684662 1.462547
     B      -0.282863 ...
2    A      -1.509059 ... 
     B      -1.135632 ...
3    A       1.212112 ...
     B      -0.173215 ...
4    A       0.119209 ...
     B      -1.044236 ...

Each "id" row represents 20 seconds of data.

But my current dataframe looks like this :

     A          B
1    0.469112   -0.282863
2    0.684662   ...

With all data for A in the same column, etc. And the row index being time in ms.

What I've tried

The closest I got to the expected result is using these lines of code :

df["timestamp"] = pd.to_datetime(df.index, unit="ms")
df_transpose = df.groupby(pd.Grouper(key="timestamp", freq="20s")).apply(
    lambda x: [x[track].tolist() for track in df.columns]
)

Which gives this result :

timestamp
1970-01-01 00:05:00    [[69.19940185546875, 68.21199798583984, 68.211...
1970-01-01 00:05:20    [[82.0363998413086, 81.04889678955078, 79.0739...
1970-01-01 00:05:40    [[67.22450256347656, 67.22450256347656, 67.224...
1970-01-01 00:06:00    [[62.287200927734375, 63.27470016479492, 62.28...

The issue is that every column end up in the same list on the same row, instead of have a separate index.

I also tried creating the multi index dataframe first with this :

col = ["A", "B"]
indexes = list(range(10))
iterables = [indexes, col]
df = pd.MultiIndex.from_product(iterables, names=["col", "index"])

Which gives the following object :

MultiIndex([(0, 'A'),
            (0, 'B'),
            (1, 'A'),
            (1, 'B'),
            (2, 'A'),
            (2, 'B'),
            (3, 'A'),
            (3, 'B'),
            (4, 'A'),
            (4, 'B')],
           names=['col', 'index'])

But I have no idea how to put data from the original dataframe in it.

Reproducible example

Data

import pandas as pd

tracks = ["A", "B", "C"]
values = [[1, 5, 9], [2, 6, 10], [3, 7, 11], [4, 8, 12]]
df = pd.DataFrame(values, columns=tracks)

df["timestamp"] = pd.to_datetime(df.index, unit="s")
grouper = pd.Grouper(key="timestamp", freq="2s")
df_transpose = df.groupby(grouper).apply(
    lambda x: [x[track].tolist() for track in df.columns if track != "timestamp"]
)

Let's say I want to group this dataframe by windows of 2 rows, my goal is to make this example look like this :

id name values
1  A    1  2
1  B    5  6
1  C    9  10
2  A    3  4
2  B    7  8
2  C    11 12

But it looks like this:

id
1970-01-01 00:00:00    [[1, 2], [5, 6], [9, 10]]
1970-01-01 00:00:02    [[3, 4], [7, 8], [11, 12]]

Is there a way to do it ? Thank you in advance.

Edit after answer

I misunderstood the input structure of sktime, so in the end I only used half of @mozway column answer :

df["timestamp"] = pd.to_datetime(df.index, unit="ms")
grouper = pd.Grouper(key="timestamp", freq="20s")
out = (
    df.assign(
        window=(g := df.groupby(grouper)).ngroup().add(1), row=g.cumcount().add(1)
    )
    .drop(columns="timestamp")
    .set_index(["window", "row"])
)

In the end it looks like this

            A  B   C
window row          
1      1    1  5   9
       2    2  6  10
2      1    3  7  11
       2    4  8  12

Solution

It looks like you want to reshape your DataFrame:

grouper = pd.Grouper(key="timestamp", freq="2s")

out = (df
  .assign(id=df.groupby(grouper).ngroup().add(1))
  .drop(columns='timestamp').rename_axis(columns='name')
  .set_index(['id']).stack()
  .groupby(level=['id', 'name']).agg(list)
  .reset_index(name='values')
)

Output:

   id name    values
0   1    A    [1, 2]
1   1    B    [5, 6]
2   1    C   [9, 10]
3   2    A    [3, 4]
4   2    B    [7, 8]
5   2    C  [11, 12]

If you want multiple columns:

grouper = pd.Grouper(key="timestamp", freq="2s")

out = (df
  .assign(id=(g:=df.groupby(grouper)).ngroup().add(1),
          col=g.cumcount().add(1)
         )
  .drop(columns='timestamp').rename_axis(columns='name')
  .set_index(['id', 'col']).stack()
  .unstack('col').add_prefix('value').reset_index()
)

Output:

col  id name  value1  value2
0     1    A       1       2
1     1    B       5       6
2     1    C       9      10
3     2    A       3       4
4     2    B       7       8
5     2    C      11      12