I have this problem, where i have columns: id, date and value. I need to groupby the id and date so that in one group, the date values should not be more than 7 days away from each other.
The data looks like this:
df = pd.DataFrame({'id' : [1, 2, 2, 3, 3, 3, 4], 'date' : ['2019-02-28', '2019-02-27', '2019-02-28', '2019-02-1', '2019-02-2', '2019-03-20', '2019-02-28'], 'value' : ['a', 'b', 'c', 'c', 'a', 'b', 'c']})
And i want to take the date groups earliest date and make a list of the values.
I expect to have something like this:
| id | date | value |
| -------- | ---------- | ---------- |
| 1 | 2019-02-28 | ['a'] |
| 2 | 2019-02-27 | ['b', 'c'] |
| 3 | 2019-02-1 | ['c', 'a'] |
| 3 | 2019-03-20 | ['b'] |
| 4 | 2019-02-28 | ['c'] |
Notice that the id group can have multiple different datetime groups.
I find it hard to sort the datetime values and make this groupby at the same time while checking that the date difference to the row next to it is below 7 days.
You can first use a custom groupby.transform
to identify the gaps of more than 7 days, then groupby.agg
:
df['date'] = pd.to_datetime(df['date'], dayfirst=False)
group = (df.groupby('id')['date']
.transform(lambda s: s.diff().gt('7D').cumsum())
)
out = (df.groupby(['id', group], as_index=False)
.agg({'date': 'min', 'value': list})
)
Output:
id date value
0 1 2019-02-28 [a]
1 2 2019-02-27 [b, c]
2 3 2019-02-01 [c, a]
3 3 2019-03-20 [b]
4 4 2019-02-28 [c]
If you want to ensure that all dates within a group are within 7 days (e.g. taking consecutive days should split after 7 rows), then you need a custom function:
def grouper(s, thresh='7D'):
groups = []
group = 0
ref = s.iloc[0]
for date in s:
if (date-ref) >= pd.Timedelta(thresh):
group += 1
ref = date
groups.append(group)
return pd.Series(groups, index=s.index)
out = (df.groupby(['id', df.groupby('id')['date'].transform(grouper)],
as_index=False)
.agg({'date': 'min', 'value': list})
)