I have a time series of data that I need to create batches of N for training.
For instance for batches of 3, need rows [0, 1, 2]
labeled as [1, 1, 1]
, [3, 4, 5]
labeled as [2, 2, 2]
, [6, 7, 8]
as [3, 3, 3]
.
Sample Data:
Diff N_Bars
0 -2.17 22.0
1 4.13 48.0
2 -0.65 4.0
3 2.06 59.0
4 -2.07 11.0
5 0.68 8.0
6 -0.43 2.0
7 1.21 19.0
8 -0.39 9.0
If you just want to replace the index and don't mind the duplicates, you can simply set a new index with index // n_per_group + 1
(floor division):
n_per_group = 3
df.index = df.index // n_per_group + 1
Advantage: You can index by the batch label.
Disadvantage: Duplicates in the index will probably cause some trouble.
Instead of replacing the index, you can of course also set this to a new row:
n_per_group = 3
df['batchlabel'] = df.index // n_per_group + 1
Advantage: No duplicates in the index.
Disadvantage: Indexing by the batch label has to be done indirectly with f.i. df[df['batchlabel'] == 2]
.
But the best way would be to create a MultiIndex
with the batches in level 0 and the old indices in level 1. This way you avoid having duplicates but are still able to index by the batch number.:
n_per_group = 3
# create multiindex
new_midx = pd.MultiIndex.from_arrays((df.index // n_per_group + 1, df.index))
# assign multiindex
df_midx = df.set_index(new_midx)
# index by batch number:
df_midx.loc[2]
# Out:
Diff N_Bars
3 2.06 59.0
4 -2.07 11.0
5 0.68 8.0