Suppose I have a pandas DataFrame with the following columns and data:
user time session time_diff
0 21.0 2022-12-16 14:03:08 5 NaN
1 21.0 2022-12-16 14:03:10 5 2.0
2 21.0 2022-12-16 14:03:12 6 2.0
3 21.0 2022-12-16 14:03:13 6 1.0
4 21.0 2022-12-28 14:49:54 16 1039601.0
5 30.0 2022-12-16 14:03:16 5 1039598.0
6 30.0 2022-12-16 14:03:18 5 2.0
7 30.0 2022-12-16 14:03:20 6 2.0
I would like to select those rows where for the same user and session the time difference (time_diff column in seconds) is less than some threshold (10 seconds, for example). Which would result in the following output:
user time session time_diff
1 21.0 2022-12-16 14:03:10 5 2.0
3 21.0 2022-12-16 14:03:13 6 1.0
6 30.0 2022-12-16 14:03:18 5 2.0
I could probably iterate through each row and select records where id = id of the preceding row and session = session of the preceding row but I believe this is not the most optimal approach.
df.groupby(['user', 'session']).filter(lambda x: (x.time_diff <= 10).any())
also does not produce the expected result.
Option 1
["user", "session"]
(df.groupby
) and check .diff
for column "time"
.Series
check < 10 seconds using Series.lt
.Series
(populated with True
& False
) for boolean indexing
to retrieve the desired subset.out = df[df.groupby(["user", "session"])['time'].diff()
.lt(pd.Timedelta('00:00:10'))]
out
user time session time_diff
1 21.0 2022-12-16 14:03:10 5 2.0
3 21.0 2022-12-16 14:03:13 6 1.0
6 30.0 2022-12-16 14:03:18 5 2.0
Option 2
(Assuming your data is properly sorted on user
and session
.)
Series.diff
to column "time"
and check < 10 seconds.user
and session
are both equal to (df.eq
) the values in the previous row (df.shift
). Use df.all
row-wise to get False
for all shifts to a new group.df
where both conditions are True
(using the bitwise operator &
).out2 = df[df.time.diff().lt(pd.Timedelta('00:00:10')) &
df[['user','session']].eq(df[['user','session']].shift(1)).all(axis=1)]
out2.equals(out)
# True
Performance comparison
Option 1 will be fastest. Both will be faster than the solution offered by @AndrejKesely.
# opt1: 1.75 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# opt2: 3.1 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# AK: 7.02 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Data used
import pandas as pd
import numpy as np
data = {'user': {0: 21.0, 1: 21.0, 2: 21.0, 3: 21.0, 4: 21.0, 5: 30.0, 6: 30.0,
7: 30.0},
'time': {0: '2022-12-16 14:03:08', 1: '2022-12-16 14:03:10',
2: '2022-12-16 14:03:12', 3: '2022-12-16 14:03:13',
4: '2022-12-28 14:49:54', 5: '2022-12-16 14:03:16',
6: '2022-12-16 14:03:18', 7: '2022-12-16 14:03:20'},
'session': {0: 5, 1: 5, 2: 6, 3: 6, 4: 16, 5: 5, 6: 5, 7: 6},
'time_diff': {0: np.nan, 1: 2.0, 2: 2.0, 3: 1.0, 4: 1039601.0, 5: 2,
6: 2.0, 7: 2.0}}
df = pd.DataFrame(data)
df['time'] = pd.to_datetime(df['time'])