Search code examples
pythonpandasdataframegroup-by

Grouping pandas dataframe by datetime and id, but having random sizes of groups


I have this problem, where i have columns: id, date and value. I need to groupby the id and date so that in one group, the date values should not be more than 7 days away from each other.

The data looks like this:

df = pd.DataFrame({'id' : [1, 2, 2, 3, 3, 3, 4], 'date' : ['2019-02-28', '2019-02-27', '2019-02-28', '2019-02-1', '2019-02-2', '2019-03-20', '2019-02-28'], 'value' : ['a', 'b', 'c', 'c', 'a', 'b', 'c']})

And i want to take the date groups earliest date and make a list of the values.

I expect to have something like this:

| id       | date       | value      |
| -------- | ---------- | ---------- |
| 1        | 2019-02-28 | ['a']      |
| 2        | 2019-02-27 | ['b', 'c'] |
| 3        | 2019-02-1  | ['c', 'a'] |
| 3        | 2019-03-20 | ['b']      |
| 4        | 2019-02-28 | ['c']      |

Notice that the id group can have multiple different datetime groups.

I find it hard to sort the datetime values and make this groupby at the same time while checking that the date difference to the row next to it is below 7 days.


Solution

  • You can first use a custom groupby.transform to identify the gaps of more than 7 days, then groupby.agg:

    df['date'] = pd.to_datetime(df['date'], dayfirst=False)
    
    group = (df.groupby('id')['date']
             .transform(lambda s: s.diff().gt('7D').cumsum())
            )
    
    out = (df.groupby(['id', group], as_index=False)
             .agg({'date': 'min', 'value': list})
          )
    

    Output:

       id       date   value
    0   1 2019-02-28     [a]
    1   2 2019-02-27  [b, c]
    2   3 2019-02-01  [c, a]
    3   3 2019-03-20     [b]
    4   4 2019-02-28     [c]
    

    If you want to ensure that all dates within a group are within 7 days (e.g. taking consecutive days should split after 7 rows), then you need a custom function:

    def grouper(s, thresh='7D'):
        groups = []
        group = 0
        ref = s.iloc[0]
        for date in s:
            if (date-ref) >= pd.Timedelta(thresh):
                group += 1
                ref = date
            groups.append(group)
        return pd.Series(groups, index=s.index)
    
    out = (df.groupby(['id', df.groupby('id')['date'].transform(grouper)],
                      as_index=False)
             .agg({'date': 'min', 'value': list})
          )