Search code examples
pythonpandasdataframepython-datetime

Group by time interval for 24 hrs and assign group values


I have a pandas data frame like below that contains date-time values in column B.


import pandas as pd

data = {'A': ['XYZ', 'XYZ', 'XYZ', 'XYZ', 'PQR', 'PQR', 'PQR', 'PQR', 'CVB', 'CVB', 'CVB', 'CVB'], 
        'B': ['2022-02-16 14:00:31', '2022-02-16 16:11:26', '2022-02-16 17:31:26', '2022-02-16 22:47:46', '2022-02-17 07:11:11', '2022-02-17 10:43:36', '2022-02-17 15:05:11', '2022-02-18 18:06:12', '2022-02-19 09:05:46', '2022-02-19 13:02:16', '2022-02-19 18:05:26', '2022-02-19 22:05:26']}
df = pd.DataFrame(data)
df['B'] = pd.to_datetime(df['B'])
df


     |   A   |          B           |                                 
     +-------+----------------------+
     |  XYZ  |  2022-02-16 14:00:31 |                  
     |  XYZ  |  2022-02-16 16:11:26 |         
     |  XYZ  |  2022-02-16 17:31:26 | 
     |  XYZ  |  2022-02-16 22:47:46 |  
     |  PQR  |  2022-02-17 07:11:11 | 
     |  PQR  |  2022-02-17 10:43:36 |
     |  PQR  |  2022-02-17 15:05:11 |
     |  PQR  |  2022-02-18 18:06:12 |
     |  CVB  |  2022-02-19 09:05:46 |
     |  CVB  |  2022-02-19 13:02:16 |
     |  CVB  |  2022-02-19 18:05:26 |
     |  CVB  |  2022-02-19 22:05:26 |
     +-------+----------------------+

I want to group the date-time values for 24hr interval period, such that my output should look like below.

Expected Output :

     |   A   |          B           |   Group  |                               
     +-------+----------------------+-----------
     |  XYZ  |  2022-02-16 14:00:31 |     1    |        
     |  XYZ  |  2022-02-16 16:11:26 |     1    |
     |  XYZ  |  2022-02-16 17:31:26 |     1    |
     |  XYZ  |  2022-02-16 22:47:46 |     1    |
     |  PQR  |  2022-02-17 07:11:11 |     1    |  
     |  PQR  |  2022-02-17 10:43:36 |     1    |     
     |  PQR  |  2022-02-17 15:05:11 |     2    |  
     |  PQR  |  2022-02-18 18:06:12 |     3    |  
     |  CVB  |  2022-02-19 09:05:46 |     3    | 
     |  CVB  |  2022-02-19 13:02:16 |     3    |
     |  CVB  |  2022-02-19 18:05:26 |     3    |
     |  CVB  |  2022-02-19 22:05:26 |     4    |
     +-------+----------------------+----------+

Currently, I tried to group the date-time values in column B for the 24hr interval time period using the below code, which resulted in an unsuccessful output that I'm not expecting. My 24hr period will first start from the initial datetime i.e 2022-02-16 14:00:31 and then the next 24hr period will start from 2022-02-17 15:05:11 and so on.


df1 = df.reset_index().set_index(df['B']).rename_axis(None)
df1.loc[df1.first('24h').index, "GROUP"] = 1
df1

Actual Output :

     |   A   |          B           |   Group    |                               
     +-------+----------------------+------------+
     |  XYZ  |  2022-02-16 14:00:31 |     1      |        
     |  XYZ  |  2022-02-16 16:11:26 |     1      |
     |  XYZ  |  2022-02-16 17:31:26 |     1      |
     |  XYZ  |  2022-02-16 22:47:46 |     1      |
     |  PQR  |  2022-02-17 07:11:11 |     1      |  
     |  PQR  |  2022-02-17 10:43:36 |     1      |     
     |  PQR  |  2022-02-17 15:05:11 |     NaN    |  
     |  PQR  |  2022-02-18 18:06:12 |     NaN    |  
     |  CVB  |  2022-02-19 09:05:46 |     NaN    | 
     |  CVB  |  2022-02-19 13:02:16 |     NaN    |
     |  CVB  |  2022-02-19 18:05:26 |     NaN    |
     |  CVB  |  2022-02-19 22:05:26 |     NaN    |
     +-------+----------------------+------------+


Is there a way where I can group the date-time values for the 24hr time interval period such that I can get the expected output as shown? I want to have an efficient solution so that the code can run on 1 million time stamps efficiently.


Solution

  • As far as each step of this process depends on the result of the previous one, we can't use resample and transform. We have to iterate all the data sequentially mapping them in a prescribed manner:

    step = pd.Timedelta('24H')
    sentinel = df.loc[0, 'B'] + step
    group_id = 1
    for index, value in df['B'].items():
        if value > sentinel:
            sentinel = value + step
            group_id += 1
        df.loc[index, 'Group'] = group_id
    

    We can also hide this process in a Series.map with help of a generator, which may work somewhat faster:

    def gen(start, step):
        sentinel = start + step
        group_id = 1
        value = yield
        while True:
            if value > sentinel:
                sentinel = value + step
                group_id += 1
            value = yield group_id
    
    
    start, step = df.loc[0, 'B'], pd.Timedelta('24H')
    marker = gen(start, step).send
    marker(None)     # pump/init/push the generator to the first yield
    df['Group'] = df['B'].map(marker)
    del marker
    

    Note, that all this works if df['B'] is sorted beforehand (which seems logical to assume due to the described grouping algorithm).

    Update (about generators)

    A new generator gen(start, step) has a methond .send(...) used to pass inside of it some value. So we can switch from the generator itself to the function my_func = get(start, step).send which will get some value and return next item from the original generator depending on this value (saving its internal state between calls).

    The first call my_func(None) is initialisation of the generator (pushing the code up to the first yield). Next time, when we call my_func(val) the variable val will be passed inside the generator (it will be assigned to value in the code above at the 4th line value = yield), and the code of the generator will continue processing from the next line after yield until it reaches next yielding, and so on.

    As for the line del marker, it is the manual deletion of a variable that refers to a generator object. Nothing special, just cleaning (not sure if it's vital though, we can delete this line).