How can I remove contiguous/consecutive/adjacent duplicates in a DataFrame?
I'm manipulating data in CSV format, am sorting by date, then by an identifying number. The identifying number can appear on different days, but I only want to delete the daily duplicates. drop_duplicates leaves one unique instance, but then deletes that identifier on all other days. I've tried this, but get the error:
localhost:~/Desktop/Public$ python3 test.py
Traceback (most recent call last):
File "test.py", line 31, in <module>
df2.loc[df2.shift(1) != df2]
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1028, in __getitem__
return self._getitem_axis(key, axis=0)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1148, in _getitem_axis
raise ValueError('Cannot index with multidimensional key')
ValueError: Cannot index with multidimensional key
I tried index_reset() to remove any multiindex. Here's a sample of the dataset:
,DATE,REC,NAME
0,07/02/2009,682566,"Schmoe, Joe"
1,07/02/2009,244828,"Doe, Joe"
2,07/11/2009,325640,"Black, Joe"
3,07/11/2009,544440,"Dirt, Joe"
4,07/11/2009,544440,"Dirt, Joe"
5,07/16/2009,200560,"White, Joe"
6,07/16/2009,685370,"Purple, Joe"
7,07/16/2009,685370,"Purple, Joe"
8,07/16/2009,635400,"Red, Joe"
9,07/16/2009,348562,"Blue, Joe
The way you index with .loc
will only work if df2
is a Series
not a DataFrame
. You're essentially trying to index with a dataframe of boleens and .loc
doesn't know what to do (it tries to use it as a multiindex):
>>> df
DATE REC NAME
0 2009-07-02 682566 Schmoe, Joe
1 2009-07-02 244828 Doe, Joe
2 2009-07-11 325640 Black, Joe
3 2009-07-11 544440 Dirt, Joe
4 2009-07-11 544440 Dirt, Joe
5 2009-07-16 200560 White, Joe
6 2009-07-16 685370 Purple, Joe
7 2009-07-16 685370 Purple, Joe
8 2009-07-16 635400 Red, Joe
9 2009-07-16 348562 Blue, Joe
>>> df.shift() != df
DATE REC NAME
0 True True True
1 False True True
2 True True True
3 False True True
4 False False False
5 True True True
6 False True True
7 False False False
8 False True True
9 False True True
Instead, you want to do something like the following:
>>> df.loc[df.DATE.shift() != df.DATE]
DATE REC NAME
0 2009-07-02 682566 Schmoe, Joe
2 2009-07-11 325640 Black, Joe
5 2009-07-16 200560 White, Joe
.loc
works here because we just create a boleen Series for index:
>>> df.DATE.shift() != df.DATE
0 True
1 False
2 True
3 False
4 False
5 True
6 False
7 False
8 False
9 False
Of course, that's not the data you want. To be equivalent to df.drop_duplicates(['REC','DATE'])
, you want the following:
>>> df.loc[(df.DATE != df.DATE.shift(1)) | (df.REC != df.REC.shift(1))]
DATE REC NAME
0 2009-07-02 682566 Schmoe, Joe
1 2009-07-02 244828 Doe, Joe
2 2009-07-11 325640 Black, Joe
3 2009-07-11 544440 Dirt, Joe
5 2009-07-16 200560 White, Joe
6 2009-07-16 685370 Purple, Joe
8 2009-07-16 635400 Red, Joe
9 2009-07-16 348562 Blue, Joe
Comparison with drop_duplicates
:
>>> df.drop_duplicates(['REC','DATE'])
DATE REC NAME
0 2009-07-02 682566 Schmoe, Joe
1 2009-07-02 244828 Doe, Joe
2 2009-07-11 325640 Black, Joe
3 2009-07-11 544440 Dirt, Joe
5 2009-07-16 200560 White, Joe
6 2009-07-16 685370 Purple, Joe
8 2009-07-16 635400 Red, Joe
9 2009-07-16 348562 Blue, Joe