I Have a dataframe which has some unique IDs in two of the columns.for e.g
S.no. Column1 Column2
1 00001x 00002x
2 00003j 00005k
3 00002x 00001x
4 00004d 00008e
Value can be anything in the string format I want to compare the two column in such a way that either of s.no 1 or 3 data remains. as these id contains the same information. only its order is different.
Basically if for one row value in a column 1 is X and column 2 is Y and for other row value in column 1 is Y and in Column 2 is x then only one of the row should remain.
is that possible in python?
You can convert your columns as frozenset
per row.
This will give a common order to apply duplicated
.
Finally, slice the rows using the previous output as mask:
mask = df.filter(like='Column').apply(frozenset, axis=1).duplicated()
df[~mask]
previous answer using set
:
mask = df.filter(like='Column').apply(lambda x: tuple(set(x)), axis=1).duplicated()
df[~mask]
NB. Using a set or sorted requires to convert as tuple (lambda x: tuple(sorted(x))
) as the duplicated
function hashes the values, which is not possible with mutable objects
output:
S.no. Column1 Column2
0 1 00001x 00002x
1 2 00003j 00005k
3 4 00004d 00008e