Search code examples
pandasdataframeduplicatesisin

Removing duplicates Pandas without drop_duplicates


Please be informed that I already looped through various posts before turning to you. In fact, I tried to implement the solution provided in : dropping rows from dataframe based on a "not in" condition

My problem is the following. Let's assume I have I huge dataframe of which I want to remove duplicates. I'm well aware I could use drop_duplicates since it is fastest an simplest approach. However, our teacher wants us to create a list containing the IDs of the duplicates and then remove them based on if the values are contained within the aforesaid list.

#My list
list1 = ['s1' , 's2']
print(len(list1))
#My dataframe
data1 = pd.DataFrame(data={'id':['s1' , 's2', 's3', 's4', 's5' , 's6']})
print(len(data1))
#Remove all the rows that hold a value contained in list1 matched against the 'id' column
data2 = data1[~data1.id.isin(list1)]
print(len(data2))

Now, let's see the output:

Len list1 = 135
Len data1 = 8942
Len data2 = 8672

So, I came to the conclusion that my code is somehow doubling the rows to be removed and removing them.

However, when I follow the drop_duplicates approach, my code works just fine and removes the 135 rows.

Could any of you help me understand why is that happening? I tried to simplify the issue as far as possible.

Thanks a lot!


Solution

  • This is an extraordinarily painful way to do what you're asking. Maybe someone will see this and make a less painful way. I specifically stayed away from groupby('id').first() as means to remove duplicates because you mentioned needing to first create a list of duplicates. But that would be my next best recommendation.

    Anyway, I added duplicates of s1 and s2 to your example

    df = pd.DataFrame(data={'id':['s1' , 's2', 's3', 's4', 's5' , 's6', 's1' , 's2', 's2']})
    

    Finding IDs with more than 1 entry (assuming duplicate). Here I do use groupby to get counts and keep those >1 and send unique values to the a list

    dup_list = df[df.groupby('id')['id'].transform('count') > 1]['id'].unique().tolist()
    print(dup_list)
    
    ['s1', 's2']
    

    Then iterate over the list finding indices that are duplicated and removing all but the first

    for id in dup_list:
        # print(df[df['id']==id].index[1:].to_list())
        drp = df[df['id']==id].index[1:].to_list()
        df.drop(drp, inplace=True)
    df
    
       id
    0  s1
    1  s2
    2  s3
    3  s4
    4  s5
    5  s6
    

    Indices 6 and 7 were dropped