I am trying to prepare data for time series classification using sktime and minirocket. The classification will work on time windows of 20 seconds. The data need to be given to minirocket as multi index pandas dataframes, like this one for example :
id name 20_s_window
1 A 0.469112 0.684662 1.462547
B -0.282863 ...
2 A -1.509059 ...
B -1.135632 ...
3 A 1.212112 ...
B -0.173215 ...
4 A 0.119209 ...
B -1.044236 ...
Each "id" row represents 20 seconds of data.
But my current dataframe looks like this :
A B
1 0.469112 -0.282863
2 0.684662 ...
With all data for A in the same column, etc. And the row index being time in ms.
The closest I got to the expected result is using these lines of code :
df["timestamp"] = pd.to_datetime(df.index, unit="ms")
df_transpose = df.groupby(pd.Grouper(key="timestamp", freq="20s")).apply(
lambda x: [x[track].tolist() for track in df.columns]
)
Which gives this result :
timestamp
1970-01-01 00:05:00 [[69.19940185546875, 68.21199798583984, 68.211...
1970-01-01 00:05:20 [[82.0363998413086, 81.04889678955078, 79.0739...
1970-01-01 00:05:40 [[67.22450256347656, 67.22450256347656, 67.224...
1970-01-01 00:06:00 [[62.287200927734375, 63.27470016479492, 62.28...
The issue is that every column end up in the same list on the same row, instead of have a separate index.
I also tried creating the multi index dataframe first with this :
col = ["A", "B"]
indexes = list(range(10))
iterables = [indexes, col]
df = pd.MultiIndex.from_product(iterables, names=["col", "index"])
Which gives the following object :
MultiIndex([(0, 'A'),
(0, 'B'),
(1, 'A'),
(1, 'B'),
(2, 'A'),
(2, 'B'),
(3, 'A'),
(3, 'B'),
(4, 'A'),
(4, 'B')],
names=['col', 'index'])
But I have no idea how to put data from the original dataframe in it.
Data
import pandas as pd
tracks = ["A", "B", "C"]
values = [[1, 5, 9], [2, 6, 10], [3, 7, 11], [4, 8, 12]]
df = pd.DataFrame(values, columns=tracks)
df["timestamp"] = pd.to_datetime(df.index, unit="s")
grouper = pd.Grouper(key="timestamp", freq="2s")
df_transpose = df.groupby(grouper).apply(
lambda x: [x[track].tolist() for track in df.columns if track != "timestamp"]
)
Let's say I want to group this dataframe by windows of 2 rows, my goal is to make this example look like this :
id name values
1 A 1 2
1 B 5 6
1 C 9 10
2 A 3 4
2 B 7 8
2 C 11 12
But it looks like this:
id
1970-01-01 00:00:00 [[1, 2], [5, 6], [9, 10]]
1970-01-01 00:00:02 [[3, 4], [7, 8], [11, 12]]
Is there a way to do it ? Thank you in advance.
I misunderstood the input structure of sktime, so in the end I only used half of @mozway column answer :
df["timestamp"] = pd.to_datetime(df.index, unit="ms")
grouper = pd.Grouper(key="timestamp", freq="20s")
out = (
df.assign(
window=(g := df.groupby(grouper)).ngroup().add(1), row=g.cumcount().add(1)
)
.drop(columns="timestamp")
.set_index(["window", "row"])
)
In the end it looks like this
A B C
window row
1 1 1 5 9
2 2 6 10
2 1 3 7 11
2 4 8 12
It looks like you want to reshape your DataFrame:
grouper = pd.Grouper(key="timestamp", freq="2s")
out = (df
.assign(id=df.groupby(grouper).ngroup().add(1))
.drop(columns='timestamp').rename_axis(columns='name')
.set_index(['id']).stack()
.groupby(level=['id', 'name']).agg(list)
.reset_index(name='values')
)
Output:
id name values
0 1 A [1, 2]
1 1 B [5, 6]
2 1 C [9, 10]
3 2 A [3, 4]
4 2 B [7, 8]
5 2 C [11, 12]
If you want multiple columns:
grouper = pd.Grouper(key="timestamp", freq="2s")
out = (df
.assign(id=(g:=df.groupby(grouper)).ngroup().add(1),
col=g.cumcount().add(1)
)
.drop(columns='timestamp').rename_axis(columns='name')
.set_index(['id', 'col']).stack()
.unstack('col').add_prefix('value').reset_index()
)
Output:
col id name value1 value2
0 1 A 1 2
1 1 B 5 6
2 1 C 9 10
3 2 A 3 4
4 2 B 7 8
5 2 C 11 12