Search code examples
pythonpandasdataframegroup-by

Python Pandas, select and drop rows grouped by multiple columns based on condition


Suppose I have a pandas DataFrame with the following columns and data:

    user            time           session         time_diff
0   21.0    2022-12-16 14:03:08        5           NaN
1   21.0    2022-12-16 14:03:10        5           2.0
2   21.0    2022-12-16 14:03:12        6           2.0
3   21.0    2022-12-16 14:03:13        6           1.0
4   21.0    2022-12-28 14:49:54        16          1039601.0
5   30.0    2022-12-16 14:03:16        5           1039598.0
6   30.0    2022-12-16 14:03:18        5           2.0
7   30.0    2022-12-16 14:03:20        6           2.0

I would like to select those rows where for the same user and session the time difference (time_diff column in seconds) is less than some threshold (10 seconds, for example). Which would result in the following output:

    user            time           session         time_diff
1   21.0    2022-12-16 14:03:10        5           2.0
3   21.0    2022-12-16 14:03:13        6           1.0
6   30.0    2022-12-16 14:03:18        5           2.0

I could probably iterate through each row and select records where id = id of the preceding row and session = session of the preceding row but I believe this is not the most optimal approach.

df.groupby(['user', 'session']).filter(lambda x: (x.time_diff <= 10).any()) 

also does not produce the expected result.


Solution

  • Option 1

    • Group by ["user", "session"] (df.groupby) and check .diff for column "time".
    • For the resulting Series check < 10 seconds using Series.lt.
    • Finally, use the resulting Series (populated with True & False) for boolean indexing to retrieve the desired subset.
    out = df[df.groupby(["user", "session"])['time'].diff()
             .lt(pd.Timedelta('00:00:10'))]
    out
    
       user                time  session  time_diff
    1  21.0 2022-12-16 14:03:10        5        2.0
    3  21.0 2022-12-16 14:03:13        6        1.0
    6  30.0 2022-12-16 14:03:18        5        2.0
    

    Option 2

    (Assuming your data is properly sorted on user and session.)

    • Apply Series.diff to column "time" and check < 10 seconds.
    • Now, also check whether row values for user and session are both equal to (df.eq) the values in the previous row (df.shift). Use df.all row-wise to get False for all shifts to a new group.
    • Finally, apply boolean indexing to select from the df where both conditions are True (using the bitwise operator &).
    out2 = df[df.time.diff().lt(pd.Timedelta('00:00:10')) & 
              df[['user','session']].eq(df[['user','session']].shift(1)).all(axis=1)]
    
    out2.equals(out)
    # True
    

    Performance comparison

    Option 1 will be fastest. Both will be faster than the solution offered by @AndrejKesely.

    # opt1: 1.75 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    # opt2: 3.1 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    # AK:   7.02 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Data used

    import pandas as pd
    import numpy as np
    
    data = {'user': {0: 21.0, 1: 21.0, 2: 21.0, 3: 21.0, 4: 21.0, 5: 30.0, 6: 30.0, 
                     7: 30.0}, 
            'time': {0: '2022-12-16 14:03:08', 1: '2022-12-16 14:03:10', 
                     2: '2022-12-16 14:03:12', 3: '2022-12-16 14:03:13', 
                     4: '2022-12-28 14:49:54', 5: '2022-12-16 14:03:16', 
                     6: '2022-12-16 14:03:18', 7: '2022-12-16 14:03:20'}, 
            'session': {0: 5, 1: 5, 2: 6, 3: 6, 4: 16, 5: 5, 6: 5, 7: 6}, 
            'time_diff': {0: np.nan, 1: 2.0, 2: 2.0, 3: 1.0, 4: 1039601.0, 5: 2, 
                          6: 2.0, 7: 2.0}}
    
    df = pd.DataFrame(data)
    
    df['time'] = pd.to_datetime(df['time'])