Search code examples
pythonpandasduplicatesdataframecontiguous

pandas DataFrame delete contiguous duplicates


How can I remove contiguous/consecutive/adjacent duplicates in a DataFrame?

I'm manipulating data in CSV format, am sorting by date, then by an identifying number. The identifying number can appear on different days, but I only want to delete the daily duplicates. drop_duplicates leaves one unique instance, but then deletes that identifier on all other days. I've tried this, but get the error:

localhost:~/Desktop/Public$ python3 test.py 
Traceback (most recent call last):
  File "test.py", line 31, in <module>
    df2.loc[df2.shift(1) != df2]
  File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1028, in __getitem__
    return self._getitem_axis(key, axis=0)
  File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1148, in _getitem_axis
    raise ValueError('Cannot index with multidimensional key')
ValueError: Cannot index with multidimensional key

Edited original post to add:

I tried index_reset() to remove any multiindex. Here's a sample of the dataset:

,DATE,REC,NAME
0,07/02/2009,682566,"Schmoe, Joe"
1,07/02/2009,244828,"Doe, Joe"
2,07/11/2009,325640,"Black, Joe"
3,07/11/2009,544440,"Dirt, Joe"
4,07/11/2009,544440,"Dirt, Joe"
5,07/16/2009,200560,"White, Joe"
6,07/16/2009,685370,"Purple, Joe"
7,07/16/2009,685370,"Purple, Joe"
8,07/16/2009,635400,"Red, Joe"
9,07/16/2009,348562,"Blue, Joe

Solution

  • The way you index with .loc will only work if df2 is a Series not a DataFrame. You're essentially trying to index with a dataframe of boleens and .loc doesn't know what to do (it tries to use it as a multiindex):

    >>> df
    
            DATE     REC         NAME
    0 2009-07-02  682566  Schmoe, Joe
    1 2009-07-02  244828     Doe, Joe
    2 2009-07-11  325640   Black, Joe
    3 2009-07-11  544440    Dirt, Joe
    4 2009-07-11  544440    Dirt, Joe
    5 2009-07-16  200560   White, Joe
    6 2009-07-16  685370  Purple, Joe
    7 2009-07-16  685370  Purple, Joe
    8 2009-07-16  635400     Red, Joe
    9 2009-07-16  348562    Blue, Joe
    
    >>> df.shift() != df
    
        DATE    REC   NAME
    0   True   True   True
    1  False   True   True
    2   True   True   True
    3  False   True   True
    4  False  False  False
    5   True   True   True
    6  False   True   True
    7  False  False  False
    8  False   True   True
    9  False   True   True
    

    Instead, you want to do something like the following:

    >>> df.loc[df.DATE.shift() != df.DATE]
    
            DATE     REC         NAME
    0 2009-07-02  682566  Schmoe, Joe
    2 2009-07-11  325640   Black, Joe
    5 2009-07-16  200560   White, Joe
    

    .loc works here because we just create a boleen Series for index:

    >>> df.DATE.shift() != df.DATE
    
    0     True
    1    False
    2     True
    3    False
    4    False
    5     True
    6    False
    7    False
    8    False
    9    False
    

    Of course, that's not the data you want. To be equivalent to df.drop_duplicates(['REC','DATE']), you want the following:

    >>>  df.loc[(df.DATE != df.DATE.shift(1)) | (df.REC != df.REC.shift(1))]
    
            DATE     REC         NAME
    0 2009-07-02  682566  Schmoe, Joe
    1 2009-07-02  244828     Doe, Joe
    2 2009-07-11  325640   Black, Joe
    3 2009-07-11  544440    Dirt, Joe
    5 2009-07-16  200560   White, Joe
    6 2009-07-16  685370  Purple, Joe
    8 2009-07-16  635400     Red, Joe
    9 2009-07-16  348562    Blue, Joe
    

    Comparison with drop_duplicates:

    >>> df.drop_duplicates(['REC','DATE'])
    
            DATE     REC         NAME
    0 2009-07-02  682566  Schmoe, Joe
    1 2009-07-02  244828     Doe, Joe
    2 2009-07-11  325640   Black, Joe
    3 2009-07-11  544440    Dirt, Joe
    5 2009-07-16  200560   White, Joe
    6 2009-07-16  685370  Purple, Joe
    8 2009-07-16  635400     Red, Joe
    9 2009-07-16  348562    Blue, Joe