Please be informed that I already looped through various posts before turning to you. In fact, I tried to implement the solution provided in : dropping rows from dataframe based on a "not in" condition
My problem is the following. Let's assume I have I huge dataframe of which I want to remove duplicates. I'm well aware I could use drop_duplicates since it is fastest an simplest approach. However, our teacher wants us to create a list containing the IDs of the duplicates and then remove them based on if the values are contained within the aforesaid list.
#My list
list1 = ['s1' , 's2']
print(len(list1))
#My dataframe
data1 = pd.DataFrame(data={'id':['s1' , 's2', 's3', 's4', 's5' , 's6']})
print(len(data1))
#Remove all the rows that hold a value contained in list1 matched against the 'id' column
data2 = data1[~data1.id.isin(list1)]
print(len(data2))
Now, let's see the output:
Len list1 = 135
Len data1 = 8942
Len data2 = 8672
So, I came to the conclusion that my code is somehow doubling the rows to be removed and removing them.
However, when I follow the drop_duplicates approach, my code works just fine and removes the 135 rows.
Could any of you help me understand why is that happening? I tried to simplify the issue as far as possible.
Thanks a lot!
This is an extraordinarily painful way to do what you're asking. Maybe someone will see this and make a less painful way. I specifically stayed away from groupby('id').first()
as means to remove duplicates because you mentioned needing to first create a list of duplicates. But that would be my next best recommendation.
Anyway, I added duplicates of s1 and s2 to your example
df = pd.DataFrame(data={'id':['s1' , 's2', 's3', 's4', 's5' , 's6', 's1' , 's2', 's2']})
Finding IDs with more than 1 entry (assuming duplicate). Here I do use groupby to get counts and keep those >1 and send unique values to the a list
dup_list = df[df.groupby('id')['id'].transform('count') > 1]['id'].unique().tolist()
print(dup_list)
['s1', 's2']
Then iterate over the list finding indices that are duplicated and removing all but the first
for id in dup_list:
# print(df[df['id']==id].index[1:].to_list())
drp = df[df['id']==id].index[1:].to_list()
df.drop(drp, inplace=True)
df
id
0 s1
1 s2
2 s3
3 s4
4 s5
5 s6
Indices 6 and 7 were dropped