Search code examples
pythonpython-3.xpandasgroup-by

Time difference between first and last row in group in pandas


Input

Date                  event   
2023-04-11 13:42:16   play   
2023-04-11 14:02:26   play 
2023-04-11 14:36:09   play 
2023-04-11 14:37:46   start
2023-04-11 14:41:34   start
2023-04-11 14:46:27   start
2023-04-11 14:47:03   start

Expecting this in pandas dataframe. Group by event order by Date and difference between the first and last time diff.

output

date          event   diff
2023-04-11    play    00:53:52 
2023-04-11    start   00:09:17

Solution

  • Use a custom groupby.agg:

    df['Date'] = pd.to_datetime(df['Date'])
    
    out = (
     df.groupby([df['Date'].dt.normalize(), 'event'])
       .agg(diff=('Date', lambda g: g.iloc[-1]-g.iloc[0]))
       .reset_index()
    )
    

    Or reuse the grouper:

    g = df.groupby([df['Date'].dt.normalize(), 'event'])['Date']
    
    out = g.last().sub(g.first()).reset_index(name='diff')
    

    Output:

            Date  event            diff
    0 2023-04-11   play 0 days 00:53:53
    1 2023-04-11  start 0 days 00:09:17
    

    Non-sorted input

    If the data is not initially sorted, use min/max as aggregation:

    out = (
     df.groupby([df['Date'].dt.normalize(), 'event'])
       .agg(diff=('Date', lambda g: g.max()-g.min()))
       .reset_index()
    )
    

    Or:

    g = df.groupby([df['Date'].dt.normalize(), 'event'])['Date']
    out = g.max().sub(g.min()).reset_index(name='diff')
    

    This is semantically identical to using np.ptp, but for some reason (and unfortunately), much more efficient. Here a timing on 700k rows:

    # max - min 
    99.7 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    # np.ptp
    355 ms ± 43.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)