Input
Date event
2023-04-11 13:42:16 play
2023-04-11 14:02:26 play
2023-04-11 14:36:09 play
2023-04-11 14:37:46 start
2023-04-11 14:41:34 start
2023-04-11 14:46:27 start
2023-04-11 14:47:03 start
Expecting this in pandas dataframe. Group by event order by Date and difference between the first and last time diff.
output
date event diff
2023-04-11 play 00:53:52
2023-04-11 start 00:09:17
Use a custom groupby.agg
:
df['Date'] = pd.to_datetime(df['Date'])
out = (
df.groupby([df['Date'].dt.normalize(), 'event'])
.agg(diff=('Date', lambda g: g.iloc[-1]-g.iloc[0]))
.reset_index()
)
Or reuse the grouper:
g = df.groupby([df['Date'].dt.normalize(), 'event'])['Date']
out = g.last().sub(g.first()).reset_index(name='diff')
Output:
Date event diff
0 2023-04-11 play 0 days 00:53:53
1 2023-04-11 start 0 days 00:09:17
If the data is not initially sorted, use min
/max
as aggregation:
out = (
df.groupby([df['Date'].dt.normalize(), 'event'])
.agg(diff=('Date', lambda g: g.max()-g.min()))
.reset_index()
)
Or:
g = df.groupby([df['Date'].dt.normalize(), 'event'])['Date']
out = g.max().sub(g.min()).reset_index(name='diff')
This is semantically identical to using np.ptp
, but for some reason (and unfortunately), much more efficient. Here a timing on 700k rows:
# max - min
99.7 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# np.ptp
355 ms ± 43.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)