I have DataFrame with multiple columns and few columns contains list values. By considering only columns with list values in it, duplicate rows have to be deleted.
Current Dataframe:
ID col1 col2 col3 col4
1 52 [kjd,pkh,sws] [aqs,zxc,asd] [plm,okn,ijb]
2 47 [qaz,wsx,edc] [aws,rfc,tgb] [rty,wer,dfg]
3 85 [kjd,pkh,sws] [aqs,zxc,asd] [plm,okn,ijb]
4 27 [asw,bxs,mdh] [wka,kdy,kaw] [pqm,lsc,yhb]
Desired output:
ID col1 col2 col3 col4
2 47 [qaz,wsx,edc] [aws,rfc,tgb] [rty,wer,dfg]
4 27 [asw,bxs,mdh] [wka,kdy,kaw] [pqm,lsc,yhb]
I have tried converting it to tuple and apply df.drop_duplicates() but am getting multiple errors
You can convert each of the columns with lists into str and then drop duplicates.
astype(str)
.drop_duplicates
with the columns as strings. Since you
want all duplicates to be removed, set keep=False
.astype(str)
columns as you no longer
need them.The full code will be:
c = ['col1','col2','col3','col4']
d =[[52,['kjd','pkh','sws'],['aqs','zxc','asd'],['plm','okn','ijb']],
[47,['qaz','wsx','edc'],['aws','rfc','tgb'],['rty','wer','dfg']],
[85,['kjd','pkh','sws'],['aqs','zxc','asd'],['plm','okn','ijb']],
[27,['asw','bxs','mdh'],['wka','kdy','kaw'],['pqm','lsc','yhb']]]
import pandas as pd
df = pd.DataFrame(d,columns=c)
print(df)
df['col2s'] = df['col2'].astype(str)
df['col3s'] = df['col3'].astype(str)
df['col4s'] = df['col4'].astype(str)
df.drop_duplicates(subset=['col2s', 'col3s','col4s'],keep=False,inplace=True)
df.drop(['col2s', 'col3s','col4s'],axis=1,inplace=True)
print (df)
The output of this will be:
Original DataFrame:
col1 col2 col3 col4
0 52 [kjd, pkh, sws] [aqs, zxc, asd] [plm, okn, ijb]
1 47 [qaz, wsx, edc] [aws, rfc, tgb] [rty, wer, dfg]
2 85 [kjd, pkh, sws] [aqs, zxc, asd] [plm, okn, ijb]
3 27 [asw, bxs, mdh] [wka, kdy, kaw] [pqm, lsc, yhb]
DataFrame after dropping the duplicates:
col1 col2 col3 col4
1 47 [qaz, wsx, edc] [aws, rfc, tgb] [rty, wer, dfg]
3 27 [asw, bxs, mdh] [wka, kdy, kaw] [pqm, lsc, yhb]