Python Pandas, select and drop rows grouped by multiple columns based on condition

Suppose I have a pandas DataFrame with the following columns and data:

    user            time           session         time_diff
0   21.0    2022-12-16 14:03:08        5           NaN
1   21.0    2022-12-16 14:03:10        5           2.0
2   21.0    2022-12-16 14:03:12        6           2.0
3   21.0    2022-12-16 14:03:13        6           1.0
4   21.0    2022-12-28 14:49:54        16          1039601.0
5   30.0    2022-12-16 14:03:16        5           1039598.0
6   30.0    2022-12-16 14:03:18        5           2.0
7   30.0    2022-12-16 14:03:20        6           2.0

I would like to select those rows where for the same user and session the time difference (time_diff column in seconds) is less than some threshold (10 seconds, for example). Which would result in the following output:

    user            time           session         time_diff
1   21.0    2022-12-16 14:03:10        5           2.0
3   21.0    2022-12-16 14:03:13        6           1.0
6   30.0    2022-12-16 14:03:18        5           2.0

I could probably iterate through each row and select records where id = id of the preceding row and session = session of the preceding row but I believe this is not the most optimal approach.

df.groupby(['user', 'session']).filter(lambda x: (x.time_diff <= 10).any())

also does not produce the expected result.

Solution

Option 1

Group by ["user", "session"] (df.groupby) and check .diff for column "time".
For the resulting Series check < 10 seconds using Series.lt.
Finally, use the resulting Series (populated with True & False) for boolean indexing to retrieve the desired subset.

out = df[df.groupby(["user", "session"])['time'].diff()
         .lt(pd.Timedelta('00:00:10'))]
out

   user                time  session  time_diff
1  21.0 2022-12-16 14:03:10        5        2.0
3  21.0 2022-12-16 14:03:13        6        1.0
6  30.0 2022-12-16 14:03:18        5        2.0

Option 2

(Assuming your data is properly sorted on user and session.)

Apply Series.diff to column "time" and check < 10 seconds.
Now, also check whether row values for user and session are both equal to (df.eq) the values in the previous row (df.shift). Use df.all row-wise to get False for all shifts to a new group.
Finally, apply boolean indexing to select from the df where both conditions are True (using the bitwise operator &).

out2 = df[df.time.diff().lt(pd.Timedelta('00:00:10')) & 
          df[['user','session']].eq(df[['user','session']].shift(1)).all(axis=1)]

out2.equals(out)
# True

Performance comparison

Option 1 will be fastest. Both will be faster than the solution offered by @AndrejKesely.

# opt1: 1.75 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# opt2: 3.1 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# AK:   7.02 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Data used

import pandas as pd
import numpy as np

data = {'user': {0: 21.0, 1: 21.0, 2: 21.0, 3: 21.0, 4: 21.0, 5: 30.0, 6: 30.0, 
                 7: 30.0}, 
        'time': {0: '2022-12-16 14:03:08', 1: '2022-12-16 14:03:10', 
                 2: '2022-12-16 14:03:12', 3: '2022-12-16 14:03:13', 
                 4: '2022-12-28 14:49:54', 5: '2022-12-16 14:03:16', 
                 6: '2022-12-16 14:03:18', 7: '2022-12-16 14:03:20'}, 
        'session': {0: 5, 1: 5, 2: 6, 3: 6, 4: 16, 5: 5, 6: 5, 7: 6}, 
        'time_diff': {0: np.nan, 1: 2.0, 2: 2.0, 3: 1.0, 4: 1039601.0, 5: 2, 
                      6: 2.0, 7: 2.0}}

df = pd.DataFrame(data)

df['time'] = pd.to_datetime(df['time'])