Search code examples
pythonpandasgraph

remove same combinations in dataframe pandas


I have a dataframe that is a edgelist for a undirected graph it looks like this:

    node 1 node 2 doc
0   Kn  Kn  doc5477 
1   TS  Kn  doc5477 
2   Kn  TS  doc5477 
3   TS  TS  doc5477 
4   Kn  Kn  doc10967
5   Kn  TS  doc10967
6   TS  TS  doc10967
7   TS  Kn  doc10967    

How can I make sure that the combinations of nodes for each document only appear once. Meaning that because row 1 and 2 have are the same I only want it to appear once. Same for rows 5 and 7?

So that my dataframe looks like this:

    node 1 node 2 doc
0   Kn  Kn  doc5477 
1   TS  Kn  doc5477     
3   TS  TS  doc5477 
4   Kn  Kn  doc10967
5   Kn  TS  doc10967
6   TS  TS  doc10967

Solution

  • First, select the columns on which you need a unique combination (node1, node2 and doc in your case) then apply a sort to return a series with a list of combinations, and finally use a boolean mask with a negative pandas.DataFrame.duplicated to keep only the rows that represent a unique combination.

    Try this:

    out= df.loc[~df[['node 1','node 2', 'doc']].apply(sorted, axis=1).duplicated()]
    

    # Output :

    print(out)
    
      node 1 node 2        doc
    0     Kn     Kn    doc5477
    1     TS     Kn    doc5477
    3     TS     TS    doc5477
    4     Kn     Kn   doc10967
    5     Kn     TS   doc10967
    6     TS     TS   doc10967