Grouping pandas dataframe by datetime and id, but having random sizes of groups

I have this problem, where i have columns: id, date and value. I need to groupby the id and date so that in one group, the date values should not be more than 7 days away from each other.

The data looks like this:

df = pd.DataFrame({'id' : [1, 2, 2, 3, 3, 3, 4], 'date' : ['2019-02-28', '2019-02-27', '2019-02-28', '2019-02-1', '2019-02-2', '2019-03-20', '2019-02-28'], 'value' : ['a', 'b', 'c', 'c', 'a', 'b', 'c']})

And i want to take the date groups earliest date and make a list of the values.

I expect to have something like this:

| id       | date       | value      |
| -------- | ---------- | ---------- |
| 1        | 2019-02-28 | ['a']      |
| 2        | 2019-02-27 | ['b', 'c'] |
| 3        | 2019-02-1  | ['c', 'a'] |
| 3        | 2019-03-20 | ['b']      |
| 4        | 2019-02-28 | ['c']      |

Notice that the id group can have multiple different datetime groups.

I find it hard to sort the datetime values and make this groupby at the same time while checking that the date difference to the row next to it is below 7 days.

Solution

You can first use a custom groupby.transform to identify the gaps of more than 7 days, then groupby.agg:

df['date'] = pd.to_datetime(df['date'], dayfirst=False)

group = (df.groupby('id')['date']
         .transform(lambda s: s.diff().gt('7D').cumsum())
        )

out = (df.groupby(['id', group], as_index=False)
         .agg({'date': 'min', 'value': list})
      )

Output:

   id       date   value
0   1 2019-02-28     [a]
1   2 2019-02-27  [b, c]
2   3 2019-02-01  [c, a]
3   3 2019-03-20     [b]
4   4 2019-02-28     [c]

If you want to ensure that all dates within a group are within 7 days (e.g. taking consecutive days should split after 7 rows), then you need a custom function:

def grouper(s, thresh='7D'):
    groups = []
    group = 0
    ref = s.iloc[0]
    for date in s:
        if (date-ref) >= pd.Timedelta(thresh):
            group += 1
            ref = date
        groups.append(group)
    return pd.Series(groups, index=s.index)

out = (df.groupby(['id', df.groupby('id')['date'].transform(grouper)],
                  as_index=False)
         .agg({'date': 'min', 'value': list})
      )