how to remove rows with same order in a column value obtained from Cartesian product / Cross join. For e.g., if a row with column value is AB and other row is BA. I need to have only 1 row AB.? Please note that I am doing this for a huge dataset obtained from cartseian product. Hence time of processing & Memory usage is key (I have only 8GB Ram) -- the no of rows is 10 Millions rows by 2000 columns ? For e.g., see below image.
Desired output:
Two ways I can think of.
df = pd.DataFrame({"UserName_x":["A","B","A","C"],
"UserName_y":["B","A","C","A"],
"edge":["AB","BA","AC","CA"]})
Method 1: sort your edge
string alphabetically and drop_duplicates
:
df["sorted"] = df["edge"].apply(lambda x: ''.join(sorted(x)))
print (df.drop_duplicates(subset="sorted"))
#
UserName_x UserName_y edge sorted
0 A B AB AB
2 A C AC AC
Method 2: Create a product of users without duplicates first, and check with isin
:
from itertools import combinations_with_replacement
unique = df["UserName_x"].unique().tolist()
combos = ["".join(i) for i in combinations_with_replacement(unique,2)]
print (df[df["edge"].isin(combos)])
#
UserName_x UserName_y edge sorted
0 A B AB AB
2 A C AC AC