Search code examples
pythonpandasdataframebinning

Bin pandas dataframe based on one column, then calculate values of different column as the sum of rows and index of rows in each bin and


I have some large text files that I want to bin into equal intervals, and then redefine one column as the sum of rows in each bin, and the redefine a second column as the index for each bin. Here is a simplified version of my dataset:

  time RN  NOR
0  100  0  0
1  101  0  0
2  104  0  0
3  105  0  0
4  107  0  0
5  110  0  0
6  114  0  0
7  115  0  0
8  116  0  0
9  118  0  0

Specifically, I want to bin the time column (which is in seconds) into bins every 5 seconds, for example. Then I want to calculate the NR column to be the number of returns for each bin, and the NR column to be the chronological return number for each individual bin. The output should look like this:

  time RN NR bin
0  100  1  3  1
1  101  2  3  1
2  104  3  3  1

3  105  1  2  2
4  107  2  2  2

5  110  1  2  3
6  114  2  2  3

7  115  1  3  4
8  116  2  3  4
9  118  3  3  4

Is there a way to do this with python?

I have tried following this example here to create bins but I am stumped about how to add the row count and sum of rows to my binned dataset.


Solution

  • Create bin column by subtract first minimal value with Series.floordiv and add 1, for RN use counter by GroupBy.cumcount and for NOR mapping by Series.value_counts:

    df['bin'] = df['time'].sub(df['time'].min()).floordiv(5).add(1)
    df['RN'] = df.groupby('bin').cumcount().add(1)
    df['NOR'] = df['bin'].map(df['bin'].value_counts())
    print (df)
       time  RN  NOR  bin
    0   100   1    3    1
    1   101   2    3    1
    2   104   3    3    1
    3   105   1    2    2
    4   107   2    2    2
    5   110   1    2    3
    6   114   2    2    3
    7   115   1    3    4
    8   116   2    3    4
    9   118   3    3    4