Search code examples
pythonpandasdataframeduplicatesdata-cleaning

Python drop duplicated pairs only


If I have a dataframe like this:

Time                        X     Y
2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.634508  200   10.1
2023-02-01T15:03:02.634508  200   10.1
2023-02-01T15:03:02.943522  200   10.1
2023-02-01T15:03:02.943522  200   10.1

I would like to remove duplicated PAIRS only. i.e. The first pair and second pair of .565333 timestamps are actually unique, but doing something like t[~t.duplicated()] will remove all the duplicates like this:

2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.634508  200   10.1
2023-02-01T15:03:02.943522  200   10.1

whereas instead I want this:

2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.634508  200   10.1
2023-02-01T15:03:02.943522  200   10.1

Solution

  • First de-duplicate using groupby.cumcount combined with floor division (floordiv) by the number of rows to group, then you will be able to drop_duplicates while maintaining the first row of each group:

    N = 2
    cols = ['Time', 'X', 'Y']
    (df.assign(n=df.groupby(cols).cumcount().floordiv(N))
       .drop_duplicates(subset=cols+['n'])
    )
    

    NB. you can use any N value to work on higher group sizes, for example N=3 to work with triplets of rows. Also, cols defines the columns to use to identify the duplicates. I assumed you want to use all columns, but you can use only a subset of them if needed.

    Output:

                             Time    X     Y  n
    0  2023-02-01T15:03:02.565333  200  10.1  0
    2  2023-02-01T15:03:02.565333  200  10.1  1  # second pair
    4  2023-02-01T15:03:02.634508  200  10.1  0
    6  2023-02-01T15:03:02.943522  200  10.1  0
    

    Intermediate before drop_duplicates:

                             Time    X     Y  n
    0  2023-02-01T15:03:02.565333  200  10.1  0
    1  2023-02-01T15:03:02.565333  200  10.1  0  # duplicated
    2  2023-02-01T15:03:02.565333  200  10.1  1
    3  2023-02-01T15:03:02.565333  200  10.1  1  # duplicated
    4  2023-02-01T15:03:02.634508  200  10.1  0
    5  2023-02-01T15:03:02.634508  200  10.1  0  # duplicated
    6  2023-02-01T15:03:02.943522  200  10.1  0
    7  2023-02-01T15:03:02.943522  200  10.1  0  # duplicated