Search code examples
pythonpandastimestampbin

Pandas - Bin rows based on time interval


Having the following DF:

             timestamp   id       val
0  2022-01-01 00:37:34  1.0  0.128464
1  2022-01-01 00:52:15  1.0  0.823504
2  2022-01-01 02:00:01  1.0  0.807617
3  2022-01-01 02:37:14  1.0  0.154851
4  2022-01-01 04:44:46  1.0  0.049817
5  2022-01-01 00:03:06  2.0  0.538565
6  2022-01-01 00:04:05  2.0  0.332919
7  2022-01-01 00:04:24  2.0  0.106591
8  2022-01-01 00:05:41  2.0  0.552562
9  2022-01-01 00:05:58  2.0  0.851130
10 2022-01-01 00:06:58  2.0  0.400711
11 2022-01-01 00:08:43  2.0  0.840532
12 2022-01-01 00:08:52  2.0  0.184425
13 2022-01-01 00:12:52  2.0  0.956525
14 2022-01-01 00:15:52  2.0  0.403509

I'm attempting to bin the values for every row within a 5min interval plus add missing rows where the interval exceeds 5min, as follows:

             timestamp   id       val
0  2022-01-01 00:37:34  1.0  0.128464 \_ val mean
1  2022-01-01 00:52:15  1.0  0.823504 /
-------------------------------------
Add missing 5min intervals with val 0
from 2022-01-01 00:52:15
to 2022-01-01 02:00:01
-------------------------------------
2  2022-01-01 02:00:01  1.0  0.807617 - val
-------------------------------------
Add missing 5min intervals with val 0
from 2022-01-01 02:00:01
to 2022-01-01 02:37:14
-------------------------------------
3  2022-01-01 02:37:14  1.0  0.154851 - val
-------------------------------------
Add missing 5min intervals with val 0
from 2022-01-01 02:37:14
to 2022-01-01 04:44:46
-------------------------------------
4  2022-01-01 04:44:46  1.0  0.049817 - val

              New Group
-------------------------------------
5  2022-01-01 00:03:06  2.0  0.538565 \
6  2022-01-01 00:04:05  2.0  0.332919 |
7  2022-01-01 00:04:24  2.0  0.106591 |_ val mean
8  2022-01-01 00:05:41  2.0  0.552562 |
9  2022-01-01 00:05:58  2.0  0.851130 |
10 2022-01-01 00:06:58  2.0  0.400711 /
-------------------------------------
11 2022-01-01 00:08:43  2.0  0.840532 \
12 2022-01-01 00:08:52  2.0  0.184425 |_ val mean
13 2022-01-01 00:12:52  2.0  0.956525 /
-------------------------------------
14 2022-01-01 00:15:52  2.0  0.403509 - val

As such, the resulting frame would consist of 5min mean values of val, and rows with 0 val where no activity occurred. I tried using pd.Grouper(key="timestamp", freq='5min', origin='start') to obtain the 5min intervals, but I wasn't sure where to proceed next.

any help would be appreciated.


Solution

  • Does this what you are looking for:

    def process(sdf):
        return (sdf.resample("5min", on="timestamp", origin=sdf.timestamp.iat[0])
                   .mean().fillna({"id": sdf.name, "val": 0}))
    
    df = (df.groupby("id", as_index=False).apply(process)
            .droplevel(level=0, axis=0).reset_index(drop=False))
    

    Result:

                 timestamp   id       val
    0  2022-01-01 00:37:34  1.0  0.128464
    1  2022-01-01 00:42:34  1.0  0.000000
    2  2022-01-01 00:47:34  1.0  0.823504
    3  2022-01-01 00:52:34  1.0  0.000000
    4  2022-01-01 00:57:34  1.0  0.000000
    5  2022-01-01 01:02:34  1.0  0.000000
    6  2022-01-01 01:07:34  1.0  0.000000
    7  2022-01-01 01:12:34  1.0  0.000000
    8  2022-01-01 01:17:34  1.0  0.000000
    9  2022-01-01 01:22:34  1.0  0.000000
    10 2022-01-01 01:27:34  1.0  0.000000
    11 2022-01-01 01:32:34  1.0  0.000000
    12 2022-01-01 01:37:34  1.0  0.000000
    13 2022-01-01 01:42:34  1.0  0.000000
    14 2022-01-01 01:47:34  1.0  0.000000
    15 2022-01-01 01:52:34  1.0  0.000000
    16 2022-01-01 01:57:34  1.0  0.807617
    17 2022-01-01 02:02:34  1.0  0.000000
    18 2022-01-01 02:07:34  1.0  0.000000
    19 2022-01-01 02:12:34  1.0  0.000000
    20 2022-01-01 02:17:34  1.0  0.000000
    21 2022-01-01 02:22:34  1.0  0.000000
    22 2022-01-01 02:27:34  1.0  0.000000
    23 2022-01-01 02:32:34  1.0  0.154851
    24 2022-01-01 02:37:34  1.0  0.000000
    25 2022-01-01 02:42:34  1.0  0.000000
    26 2022-01-01 02:47:34  1.0  0.000000
    27 2022-01-01 02:52:34  1.0  0.000000
    28 2022-01-01 02:57:34  1.0  0.000000
    29 2022-01-01 03:02:34  1.0  0.000000
    30 2022-01-01 03:07:34  1.0  0.000000
    31 2022-01-01 03:12:34  1.0  0.000000
    32 2022-01-01 03:17:34  1.0  0.000000
    33 2022-01-01 03:22:34  1.0  0.000000
    34 2022-01-01 03:27:34  1.0  0.000000
    35 2022-01-01 03:32:34  1.0  0.000000
    36 2022-01-01 03:37:34  1.0  0.000000
    37 2022-01-01 03:42:34  1.0  0.000000
    38 2022-01-01 03:47:34  1.0  0.000000
    39 2022-01-01 03:52:34  1.0  0.000000
    40 2022-01-01 03:57:34  1.0  0.000000
    41 2022-01-01 04:02:34  1.0  0.000000
    42 2022-01-01 04:07:34  1.0  0.000000
    43 2022-01-01 04:12:34  1.0  0.000000
    44 2022-01-01 04:17:34  1.0  0.000000
    45 2022-01-01 04:22:34  1.0  0.000000
    46 2022-01-01 04:27:34  1.0  0.000000
    47 2022-01-01 04:32:34  1.0  0.000000
    48 2022-01-01 04:37:34  1.0  0.000000
    49 2022-01-01 04:42:34  1.0  0.049817
    50 2022-01-01 00:03:06  2.0  0.463746
    51 2022-01-01 00:08:06  2.0  0.660494
    52 2022-01-01 00:13:06  2.0  0.403509
    

    But I don't understand this requirement:

                 timestamp   id       val
    0  2022-01-01 00:37:34  1.0  0.128464 \_ val mean
    1  2022-01-01 00:52:15  1.0  0.823504 /
    

    Those 2 timestamps are not within a 5 minutes interval?