python pandas dataframe duplicates data-cleaning

Python drop duplicated pairs only

If I have a dataframe like this:

Time                        X     Y
2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.634508  200   10.1
2023-02-01T15:03:02.634508  200   10.1
2023-02-01T15:03:02.943522  200   10.1
2023-02-01T15:03:02.943522  200   10.1

I would like to remove duplicated PAIRS only. i.e. The first pair and second pair of .565333 timestamps are actually unique, but doing something like t[~t.duplicated()] will remove all the duplicates like this:

2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.634508  200   10.1
2023-02-01T15:03:02.943522  200   10.1

whereas instead I want this:

2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.634508  200   10.1
2023-02-01T15:03:02.943522  200   10.1

Solution

First de-duplicate using groupby.cumcount combined with floor division (floordiv) by the number of rows to group, then you will be able to drop_duplicates while maintaining the first row of each group:

N = 2
cols = ['Time', 'X', 'Y']
(df.assign(n=df.groupby(cols).cumcount().floordiv(N))
   .drop_duplicates(subset=cols+['n'])
)

NB. you can use any N value to work on higher group sizes, for example N=3 to work with triplets of rows. Also, cols defines the columns to use to identify the duplicates. I assumed you want to use all columns, but you can use only a subset of them if needed.

Output:

                         Time    X     Y  n
0  2023-02-01T15:03:02.565333  200  10.1  0
2  2023-02-01T15:03:02.565333  200  10.1  1  # second pair
4  2023-02-01T15:03:02.634508  200  10.1  0
6  2023-02-01T15:03:02.943522  200  10.1  0

Intermediate before drop_duplicates:

                         Time    X     Y  n
0  2023-02-01T15:03:02.565333  200  10.1  0
1  2023-02-01T15:03:02.565333  200  10.1  0  # duplicated
2  2023-02-01T15:03:02.565333  200  10.1  1
3  2023-02-01T15:03:02.565333  200  10.1  1  # duplicated
4  2023-02-01T15:03:02.634508  200  10.1  0
5  2023-02-01T15:03:02.634508  200  10.1  0  # duplicated
6  2023-02-01T15:03:02.943522  200  10.1  0
7  2023-02-01T15:03:02.943522  200  10.1  0  # duplicated