Search code examples
pythonpandascartesian

Remove rows with same order in a column value obtained from Cartesian product / Cross join.If a row value is AB and other value is BA. I need only AB


how to remove rows with same order in a column value obtained from Cartesian product / Cross join. For e.g., if a row with column value is AB and other row is BA. I need to have only 1 row AB.? Please note that I am doing this for a huge dataset obtained from cartseian product. Hence time of processing & Memory usage is key (I have only 8GB Ram) -- the no of rows is 10 Millions rows by 2000 columns ? For e.g., see below image.

enter image description here

Desired output:

enter image description here


Solution

  • Two ways I can think of.

    df = pd.DataFrame({"UserName_x":["A","B","A","C"],
                       "UserName_y":["B","A","C","A"],
                       "edge":["AB","BA","AC","CA"]})
    

    Method 1: sort your edge string alphabetically and drop_duplicates:

    df["sorted"] = df["edge"].apply(lambda x: ''.join(sorted(x)))
    
    print (df.drop_duplicates(subset="sorted"))
    
    #
      UserName_x UserName_y edge sorted
    0          A          B   AB     AB
    2          A          C   AC     AC
    

    Method 2: Create a product of users without duplicates first, and check with isin:

    from itertools import combinations_with_replacement
    
    unique = df["UserName_x"].unique().tolist()
    combos = ["".join(i) for i in combinations_with_replacement(unique,2)]
    
    print (df[df["edge"].isin(combos)])
    
    #
      UserName_x UserName_y edge sorted
    0          A          B   AB     AB
    2          A          C   AC     AC