If I have a dataframe like this:
Time X Y
2023-02-01T15:03:02.565333 200 10.1
2023-02-01T15:03:02.565333 200 10.1
2023-02-01T15:03:02.565333 200 10.1
2023-02-01T15:03:02.565333 200 10.1
2023-02-01T15:03:02.634508 200 10.1
2023-02-01T15:03:02.634508 200 10.1
2023-02-01T15:03:02.943522 200 10.1
2023-02-01T15:03:02.943522 200 10.1
I would like to remove duplicated PAIRS only. i.e. The first pair and second pair of .565333 timestamps are actually unique, but doing something like t[~t.duplicated()]
will remove all the duplicates like this:
2023-02-01T15:03:02.565333 200 10.1
2023-02-01T15:03:02.634508 200 10.1
2023-02-01T15:03:02.943522 200 10.1
whereas instead I want this:
2023-02-01T15:03:02.565333 200 10.1
2023-02-01T15:03:02.565333 200 10.1
2023-02-01T15:03:02.634508 200 10.1
2023-02-01T15:03:02.943522 200 10.1
First de-duplicate using groupby.cumcount
combined with floor division (floordiv
) by the number of rows to group, then you will be able to drop_duplicates
while maintaining the first row of each group:
N = 2
cols = ['Time', 'X', 'Y']
(df.assign(n=df.groupby(cols).cumcount().floordiv(N))
.drop_duplicates(subset=cols+['n'])
)
NB. you can use any N
value to work on higher group sizes, for example N=3
to work with triplets of rows. Also, cols
defines the columns to use to identify the duplicates. I assumed you want to use all columns, but you can use only a subset of them if needed.
Output:
Time X Y n
0 2023-02-01T15:03:02.565333 200 10.1 0
2 2023-02-01T15:03:02.565333 200 10.1 1 # second pair
4 2023-02-01T15:03:02.634508 200 10.1 0
6 2023-02-01T15:03:02.943522 200 10.1 0
Intermediate before drop_duplicates
:
Time X Y n
0 2023-02-01T15:03:02.565333 200 10.1 0
1 2023-02-01T15:03:02.565333 200 10.1 0 # duplicated
2 2023-02-01T15:03:02.565333 200 10.1 1
3 2023-02-01T15:03:02.565333 200 10.1 1 # duplicated
4 2023-02-01T15:03:02.634508 200 10.1 0
5 2023-02-01T15:03:02.634508 200 10.1 0 # duplicated
6 2023-02-01T15:03:02.943522 200 10.1 0
7 2023-02-01T15:03:02.943522 200 10.1 0 # duplicated