Search code examples
pythonpandassklearn-pandasdata-wrangling

Pandas how to create groups of continuous batches (time series data)


I have a time series of data that I need to create batches of N for training. For instance for batches of 3, need rows [0, 1, 2] labeled as [1, 1, 1], [3, 4, 5] labeled as [2, 2, 2], [6, 7, 8] as [3, 3, 3].

Sample Data:

   Diff  N_Bars
0 -2.17    22.0
1  4.13    48.0
2 -0.65     4.0
3  2.06    59.0
4 -2.07    11.0
5  0.68     8.0
6 -0.43     2.0
7  1.21    19.0
8 -0.39     9.0

Solution

  • If you just want to replace the index and don't mind the duplicates, you can simply set a new index with index // n_per_group + 1 (floor division):

    n_per_group = 3
    df.index = df.index // n_per_group + 1
    

    Advantage: You can index by the batch label.
    Disadvantage: Duplicates in the index will probably cause some trouble.


    Instead of replacing the index, you can of course also set this to a new row:

    n_per_group = 3
    df['batchlabel'] = df.index // n_per_group + 1
    

    Advantage: No duplicates in the index.
    Disadvantage: Indexing by the batch label has to be done indirectly with f.i. df[df['batchlabel'] == 2].

    Recommended solution:


    But the best way would be to create a MultiIndex with the batches in level 0 and the old indices in level 1. This way you avoid having duplicates but are still able to index by the batch number.:

    n_per_group = 3
    # create multiindex
    new_midx = pd.MultiIndex.from_arrays((df.index //  n_per_group  +  1, df.index))
    # assign multiindex
    df_midx = df.set_index(new_midx)
    
    # index by batch number:
    df_midx.loc[2]
    # Out:
       Diff  N_Bars
    3  2.06    59.0
    4 -2.07    11.0
    5  0.68     8.0