Search code examples
pandaspandas-groupbydrop-duplicates

Drop almost duplicates rows based on timestamp


I'm trying to remove some data almost duplicates. I'm looking for a way to detect the closest (edited_at) trip made by the user without losing informations.

So I want to solve this problem by calculating the difference between succesive timestamps and I remove minimum difference (zero in this example: step-1).

I am open for other suggestions

Note:

Don't use drop_duplicates() function.

Dataframe:

data = [[111, 121, "2019-10-22 05:00:00", 0],
        [111, 121, "2019-10-22 05:00:00", 1],
        [111, 123, "2019-10-22 07:10:00", 0], 
        [111, 123, "2019-10-22 07:10:00", 1], 
        [111, 123, "2019-10-22 07:10:00", 2],
        [111, 124, "2019-10-22 07:20:00", 0],
        [111, 124, "2019-10-22 07:20:00", 1],
        [111, 124, "2019-10-22 07:20:00", 2],
        [111, 124, "2019-10-22 07:20:00", 3],
        [111, 125, "2019-10-22 19:20:00", 0], 
        [111, 125, "2019-10-22 19:20:00", 1],
        [222, 223, "2019-11-24 06:00:00", 0], 
        [222, 223, "2019-11-24 06:00:00", 1], 
        [222, 244, "2019-11-24 06:15:00", 0],
        [222, 244, "2019-11-24 06:15:00", 1],
        [222, 255, "2019-11-24 18:15:10", 0],
        [222, 255, "2019-11-24 18:15:10", 1]]
df = pd.DataFrame(data, columns = ["user_id", "prompt_uuid", "edited_at", "prompt_num"]) 

df['edited_at'] = pd.to_datetime(df['edited_at'])

Step-1:

111, 121, "2019-10-22 05:00:00", 0, somthing, 
111, 121, "2019-10-22 05:00:00", 1, somthing, 
111, 123, "2019-10-22 07:10:00", 0, 140,
111, 123, "2019-10-22 07:10:00", 1, 140,
111, 123, "2019-10-22 07:10:00", 2, 140,
111, 124, "2019-10-22 07:20:00", 0,  10,
111, 124, "2019-10-22 07:20:00", 1,  10,
111, 124, "2019-10-22 07:20:00", 2,  10,
111, 124, "2019-10-22 07:20:00", 3,  10,
111, 125, "2019-10-22 19:20:00", 0, 720, 
111, 125, "2019-10-22 19:20:00", 1, 720,
222, 223, "2019-11-24 06:00:00", 0,   0, 
222, 223, "2019-11-24 06:00:00", 1,   0, 
222, 244, "2019-11-24 06:15:00", 0,  15,
222, 244, "2019-11-24 06:15:00", 1,  15,
222, 255, "2019-11-24 18:15:10", 0, 720,
222, 255, "2019-11-24 18:15:10", 1, 720

Step-2:

111, 121, "2019-10-22 05:00:00", 0,  somthing,
111, 121, "2019-10-22 05:00:00", 1,  somthing, 
111, 124, "2019-10-22 07:20:00", 0,  10,
111, 124, "2019-10-22 07:20:00", 1,  10,
111, 124, "2019-10-22 07:20:00", 2,  10,
111, 124, "2019-10-22 07:20:00", 3,  10,
111, 125, "2019-10-22 19:20:00", 0, 720, 
111, 125, "2019-10-22 19:20:00", 1, 720,
222, 244, "2019-11-24 06:15:00", 0,  15,
222, 244, "2019-11-24 06:15:00", 1,  15,
222, 255, "2019-11-24 18:15:10", 0, 720,
222, 255, "2019-11-24 18:15:10", 1, 720

Solution

  • Because your DataFrame is duplicated with respect to ['user_id', 'prompt_uuid'] taking a simple diff does not give the time difference between successive groups. First drop_duplicates then calculate the time difference within each 'user_id'. You can then filter this to find the smallest time difference for each user:

    s = df.drop_duplicates(['user_id', 'prompt_uuid']).copy()
    s['time_diff'] = s.groupby('user_id')['edited_at'].diff(-1).abs()
    s = s[s['time_diff'] == s.groupby('user_id')['time_diff'].transform('min')]
    
    #    user_id  prompt_uuid           edited_at  prompt_num time_diff
    #2       111          123 2019-10-22 07:10:00           0  00:10:00
    #11      222          223 2019-11-24 06:00:00           0  00:15:00
    

    Now if you want to further subset this to the rows where the time difference was within some small threshold (i.e. you're fine keeping a group where the minimum time difference is 4 hours...)

    # Doesn't alter `s` in this example as both min_diffs are < 1 hour.
    min_time = '1 hour'
    s = s[s['time_diff'].le(min_time)]
    

    Now s represents the unique ['user_id', 'prompt_uuid'] groups that you want to remove from the DataFrame. We accomplish this with an outer excluding merge, using the indicator:

    keys = ['user_id', 'prompt_uuid']
    df = (df.merge(s[keys], on=keys, how='outer', indicator=True)
            .query('_merge == "left_only"')
            .drop(columns='_merge'))
    

        user_id  prompt_uuid           edited_at  prompt_num
    0       111          121 2019-10-22 05:00:00           0
    1       111          121 2019-10-22 05:00:00           1
    5       111          124 2019-10-22 07:20:00           0
    6       111          124 2019-10-22 07:20:00           1
    7       111          124 2019-10-22 07:20:00           2
    8       111          124 2019-10-22 07:20:00           3
    9       111          125 2019-10-22 19:20:00           0
    10      111          125 2019-10-22 19:20:00           1
    13      222          244 2019-11-24 06:15:00           0
    14      222          244 2019-11-24 06:15:00           1
    15      222          255 2019-11-24 18:15:10           0
    16      222          255 2019-11-24 18:15:10           1