I have a big dataframe with the format
col1 col2 val1 val2
[0]A B 0.8 0.1
[1]B A 0.8 0.1
[2]A C 0.3 0.9
[3]A D 0.2 0.8
[4]D A 0.2 0.8
As you can see, some rows are duplicated with only col1
and col2
reversed. For instance, row 1
is duplicate of row 0
, and row 4
is duplicate of row 3
. Can you drop the duplicates based on the match of col1
:col2
being equal to those of col2
:col1
?
You can sorting both columns by np.sort
and assign back, then use DataFrame.drop_duplicates
with specify some columns:
df[['col1','col2']] = np.sort(df[['col1','col2']], axis=1)
df1 = df.drop_duplicates(['col1','col2'])
print (df1)
col1 col2 val1 val2
0 A B 0.8 0.1
2 A C 0.3 0.9
3 A D 0.2 0.8
Of remove duplicates by all columns:
df2 = df.drop_duplicates()
print (df2)
col1 col2 val1 val2
0 A B 0.8 0.1
2 A C 0.3 0.9
3 A D 0.2 0.8