I'm trying to remove some data almost duplicates. I'm looking for a way to detect the closest (edited_at
) trip made by the user without losing informations.
So I want to solve this problem by calculating the difference between succesive timestamps and I remove minimum difference (zero in this example: step-1).
I am open for other suggestions
Note:
Don't use drop_duplicates()
function.
Dataframe:
data = [[111, 121, "2019-10-22 05:00:00", 0],
[111, 121, "2019-10-22 05:00:00", 1],
[111, 123, "2019-10-22 07:10:00", 0],
[111, 123, "2019-10-22 07:10:00", 1],
[111, 123, "2019-10-22 07:10:00", 2],
[111, 124, "2019-10-22 07:20:00", 0],
[111, 124, "2019-10-22 07:20:00", 1],
[111, 124, "2019-10-22 07:20:00", 2],
[111, 124, "2019-10-22 07:20:00", 3],
[111, 125, "2019-10-22 19:20:00", 0],
[111, 125, "2019-10-22 19:20:00", 1],
[222, 223, "2019-11-24 06:00:00", 0],
[222, 223, "2019-11-24 06:00:00", 1],
[222, 244, "2019-11-24 06:15:00", 0],
[222, 244, "2019-11-24 06:15:00", 1],
[222, 255, "2019-11-24 18:15:10", 0],
[222, 255, "2019-11-24 18:15:10", 1]]
df = pd.DataFrame(data, columns = ["user_id", "prompt_uuid", "edited_at", "prompt_num"])
df['edited_at'] = pd.to_datetime(df['edited_at'])
Step-1:
111, 121, "2019-10-22 05:00:00", 0, somthing,
111, 121, "2019-10-22 05:00:00", 1, somthing,
111, 123, "2019-10-22 07:10:00", 0, 140,
111, 123, "2019-10-22 07:10:00", 1, 140,
111, 123, "2019-10-22 07:10:00", 2, 140,
111, 124, "2019-10-22 07:20:00", 0, 10,
111, 124, "2019-10-22 07:20:00", 1, 10,
111, 124, "2019-10-22 07:20:00", 2, 10,
111, 124, "2019-10-22 07:20:00", 3, 10,
111, 125, "2019-10-22 19:20:00", 0, 720,
111, 125, "2019-10-22 19:20:00", 1, 720,
222, 223, "2019-11-24 06:00:00", 0, 0,
222, 223, "2019-11-24 06:00:00", 1, 0,
222, 244, "2019-11-24 06:15:00", 0, 15,
222, 244, "2019-11-24 06:15:00", 1, 15,
222, 255, "2019-11-24 18:15:10", 0, 720,
222, 255, "2019-11-24 18:15:10", 1, 720
Step-2:
111, 121, "2019-10-22 05:00:00", 0, somthing,
111, 121, "2019-10-22 05:00:00", 1, somthing,
111, 124, "2019-10-22 07:20:00", 0, 10,
111, 124, "2019-10-22 07:20:00", 1, 10,
111, 124, "2019-10-22 07:20:00", 2, 10,
111, 124, "2019-10-22 07:20:00", 3, 10,
111, 125, "2019-10-22 19:20:00", 0, 720,
111, 125, "2019-10-22 19:20:00", 1, 720,
222, 244, "2019-11-24 06:15:00", 0, 15,
222, 244, "2019-11-24 06:15:00", 1, 15,
222, 255, "2019-11-24 18:15:10", 0, 720,
222, 255, "2019-11-24 18:15:10", 1, 720
Because your DataFrame is duplicated with respect to ['user_id', 'prompt_uuid']
taking a simple diff
does not give the time difference between successive groups. First drop_duplicates
then calculate the time difference within each 'user_id'
. You can then filter this to find the smallest time difference for each user:
s = df.drop_duplicates(['user_id', 'prompt_uuid']).copy()
s['time_diff'] = s.groupby('user_id')['edited_at'].diff(-1).abs()
s = s[s['time_diff'] == s.groupby('user_id')['time_diff'].transform('min')]
# user_id prompt_uuid edited_at prompt_num time_diff
#2 111 123 2019-10-22 07:10:00 0 00:10:00
#11 222 223 2019-11-24 06:00:00 0 00:15:00
Now if you want to further subset this to the rows where the time difference was within some small threshold (i.e. you're fine keeping a group where the minimum time difference is 4 hours...)
# Doesn't alter `s` in this example as both min_diffs are < 1 hour.
min_time = '1 hour'
s = s[s['time_diff'].le(min_time)]
Now s
represents the unique ['user_id', 'prompt_uuid']
groups that you want to remove from the DataFrame. We accomplish this with an outer
excluding merge, using the indicator
:
keys = ['user_id', 'prompt_uuid']
df = (df.merge(s[keys], on=keys, how='outer', indicator=True)
.query('_merge == "left_only"')
.drop(columns='_merge'))
user_id prompt_uuid edited_at prompt_num
0 111 121 2019-10-22 05:00:00 0
1 111 121 2019-10-22 05:00:00 1
5 111 124 2019-10-22 07:20:00 0
6 111 124 2019-10-22 07:20:00 1
7 111 124 2019-10-22 07:20:00 2
8 111 124 2019-10-22 07:20:00 3
9 111 125 2019-10-22 19:20:00 0
10 111 125 2019-10-22 19:20:00 1
13 222 244 2019-11-24 06:15:00 0
14 222 244 2019-11-24 06:15:00 1
15 222 255 2019-11-24 18:15:10 0
16 222 255 2019-11-24 18:15:10 1