Search code examples
pythonpandasgroup-by

pandas group by time intervall with dynamic intervall start


I have a dataframe defined as:

datas = [['A', 51, 'id1', '2020-05-27 05:50:43.346'], ['A', 51, 'id2',
'2020-05-27 05:51:08.347'], ['B', 45, 'id3', '2020-05-24 17:24:05.142'],['B', 45, 'id4', '2020-05-24 17:23:30.141'], ['C', 34,
'id5', '2020-05-23 17:31:10.341']]
df = pd.DataFrame(datas, columns = ['col1', 'col2', 'cold_id',
'dates'])
df['dates'] = pd.to_datetime(df.dates)

looking like this

    col1    col2    cold_id dates
0   A   51  id1 2020-05-27 05:50:43.346
1   A   51  id2 2020-05-27 05:51:08.347
2   B   45  id3 2020-05-24 17:24:05.142
3   B   45  id4 2020-05-24 17:23:30.141
4   C   34  id5 2020-05-23 17:31:10.341

I want to group its rows such that all rows whose date is less than 2 minutes apart form part of one group. I tried the following approach:

df.groupby([pd.Grouper(key='dates', freq='2 min'), 'col1']).agg(','.join).reset_index().sort_values('col1').reset_index(drop=True)

Which yields:

dates   col1    cold_id
0   2020-05-27 05:50:00 A   id1,id2
1   2020-05-24 17:22:00 B   id4
2   2020-05-24 17:24:00 B   id3
3   2020-05-23 17:30:00 C   id5

This is not what I am looking for since rows with id3 and id4 should be within the same group as they are only 30s apart from each other.

My preferred output looks like this:

dates   col1    cold_id
0   2020-05-27 05:50:43.346 A   id1,id2
1   2020-05-24 17:23:30.141 B   id3, id4
3   2020-05-23 17:31:10.341 C   id5

How can it be achieved?


Solution

  • I would use:

    funcs = {c: 'first' for c in df}
    funcs['cold_id'] = ','.join
    
    out = (df.groupby(df['dates'].sort_values().diff().gt('2min').cumsum(),
                      as_index=False, sort=False)
             .agg(funcs)
          )
    

    Note that this doesn't necessarily group by col1/col2, if you need to add this as grouper:

    out = (df.groupby([df['dates'].sort_values().diff().gt('2min').cumsum(),
                      'col1', 'col2'], as_index=False, sort=False)
             .agg(funcs)
          )
    

    Output:

      col1  col2  cold_id                   dates
    0    A    51  id1,id2 2020-05-27 05:50:43.346
    1    B    45  id3,id4 2020-05-24 17:24:05.142
    2    C    34      id5 2020-05-23 17:31:10.341