Search code examples
pythonpandasdataframeduplicatesdrop

Interesting Pandas dataframe problem: how to drop duplicates (inverse) over two columns - for each row with a common attribute?


After filtering out the inverse duplicates, I have to count how many actual duplicates there are. Here is my (working example) code, it's too slow though, for 90 000+ rows.. using iterrows:

import pandas as pd

data = {'id_x':[1,2,3,4,5,6],      
'ADDICTOID_x':['BFO:0000023', 'MF:0000016', 'BFO:0000023', 'MF:0000016', 'MF:0000016', 'ADDICTO:0000872'],     
'PMID':[34116904, 34116904, 34112174, 34112174, 34112174, 22429780],                  
'LABEL_x':['role', 'human being', 'role', 'human being', 'human being', 'FDA'],   
'id_y':[11,12,13,14,15,16],    
'ADDICTOID_y':['MF:0000016', 'BFO:0000023', 'MF:0000016', 'BFO:0000023', 'BFO:0000023', 'ADDICTO:0000904'],                
'LABEL_y':['human being', 'role', 'human being', 'role', 'role', '']}
dcp = pd.DataFrame(data)

dcp = dcp.drop(dcp[dcp.LABEL_x == dcp.LABEL_y].index)

for index, row in dcp.iterrows():  # THIS IS SLOW
        if ((dcp['ADDICTOID_x'] == row['ADDICTOID_y'])
            & (dcp['ADDICTOID_y'] == row['ADDICTOID_x'])
            & (dcp['PMID'] == row['PMID'])).any():  # Does the inverse of this row exist in the table?
            dcp.drop(index, inplace=True)

print("dcp after drop: ")
print(dcp)

I can't just use dcp.duplicated(subset=['ADDICTOID_x', 'ADDICTOID_y'], keep='first') because that removes ALL of the duplicates (there are many) and I only want to do them one by one, and the 'PMID' needs to match also. Similarly, (dcp.ADDICTOID_x + dcp.ADDICTOID_y).isin(dcp.ADDICTOID_y + dcp.ADDICTOID_x) & (dcp.PMID == dcp.PMID) finds rows with duplicates everywhere. Iterrows and test one by one is the only way I have found which works, but it's too slow. Anyone know of a solution to this?

After filtering for inverse duplicates, I count like so: data_chord_plot = dcp.groupby(['LABEL_x', 'LABEL_y'], as_index=False)[['PMID']].count() data_chord_plot.columns = ['source','target','value']

EDIT: in this simple example, rows 1 and 3 are removed as they are inverse duplicates of rows 2 and 4.

EDIT: I need to eliminate the "mirror" image of rows with inverse duplicates over the two columns, but only one for each row with a duplicate. Some rows don't have a mirror image.

CORRECT OUTPUT FROM (SLOW) EXAMPLE:

id_x ADDICTOID_x PMID LABEL_x id_y ADDICTOID_y LABEL_y

1 2 MF:0000016 34116904 human being 12 BFO:0000023 role

3 4 MF:0000016 34112174 human being 14 BFO:0000023 role

4 5 MF:0000016 34112174 human being 15 BFO:0000023 role

5 6 ADDICTO:0000872 22429780 FDA 16 ADDICTO:0000904


Solution

  • Create a sorted tuple of ADDICTOID_xy and use drop_duplicates with the right subset:

    dcp['ADDICTOID'] = dcp[['ADDICTOID_x', 'ADDICTOID_y']].apply(sorted, axis=1) \
                                                          .apply(tuple)
    
    out = dcp.drop_duplicates(subset=['ADDICTOID', 'PMID'], keep='first')
    
    >>> out
       id_x      ADDICTOID_x      PMID LABEL_x  id_y      ADDICTOID_y      LABEL_y                           ADDICTOID
    0     1      BFO:0000023  34116904    role    11       MF:0000016  human being           (BFO:0000023, MF:0000016)
    2     3      BFO:0000023  34112174    role    13       MF:0000016  human being           (BFO:0000023, MF:0000016)
    5     6  ADDICTO:0000872  22429780     FDA    16  ADDICTO:0000904               (ADDICTO:0000872, ADDICTO:0000904)