Time difference between first and last row in group in pandas

Input

Date                  event   
2023-04-11 13:42:16   play   
2023-04-11 14:02:26   play 
2023-04-11 14:36:09   play 
2023-04-11 14:37:46   start
2023-04-11 14:41:34   start
2023-04-11 14:46:27   start
2023-04-11 14:47:03   start

Expecting this in pandas dataframe. Group by event order by Date and difference between the first and last time diff.

output

date          event   diff
2023-04-11    play    00:53:52 
2023-04-11    start   00:09:17

Solution

Use a custom groupby.agg:

df['Date'] = pd.to_datetime(df['Date'])

out = (
 df.groupby([df['Date'].dt.normalize(), 'event'])
   .agg(diff=('Date', lambda g: g.iloc[-1]-g.iloc[0]))
   .reset_index()
)

Or reuse the grouper:

g = df.groupby([df['Date'].dt.normalize(), 'event'])['Date']

out = g.last().sub(g.first()).reset_index(name='diff')

Output:

        Date  event            diff
0 2023-04-11   play 0 days 00:53:53
1 2023-04-11  start 0 days 00:09:17

Non-sorted input

If the data is not initially sorted, use min/max as aggregation:

out = (
 df.groupby([df['Date'].dt.normalize(), 'event'])
   .agg(diff=('Date', lambda g: g.max()-g.min()))
   .reset_index()
)

Or:

g = df.groupby([df['Date'].dt.normalize(), 'event'])['Date']
out = g.max().sub(g.min()).reset_index(name='diff')

This is semantically identical to using np.ptp, but for some reason (and unfortunately), much more efficient. Here a timing on 700k rows:

# max - min 
99.7 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# np.ptp
355 ms ± 43.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)