I have table similar to this one:
index name_1 path1 name_2 path2
0 Roy path/to/Roy Anne path/to/Anne
1 Anne path/to/Anne Roy path/to/Roy
2 Hari path/to/Hari Wili path/to/Wili
3 Wili path/to/Wili Hari path/to/Hari
4 Miko path/to/miko Lin path/to/lin
5 Miko path/to/miko Dan path/to/dan
6 Lin path/to/lin Miko path/to/miko
7 Lin path/to/lin Dan path/to/dan
8 Dan path/to/dan Miko path/to/miko
9 Dan path/to/dan Lin path/to/lin
...
As you can see, the table kind of showing relationship between entities -
Roi is with Anne,
Wili with Hari,
Lin with Dan and with Miko.
The table is actually showing overlap data , meaning, Hari and wili for example, have the same document, and I would like to remove one of them not to have duplicated files. In order to do this, I would like to create new table that has only one name in relationship, so I can later create list of paths to remove.
The result table will look like this :
index name_1 path1 name_2 path2
0 Roy path/to/Roy Anne path/to/Anne
1 Hari path/to/Hari Wili path/to/Wili
2 Miko path/to/miko Lin path/to/lin
3 Miko path/to/miko Dan path/to/dan
The idea is that I'll use the values of "path2" to remove files with this path, and will still have the files in path1. for that reason, this line:
4 Lin path/to/lin Dan path/to/dan
is missing, as it will be removed using Miko... any ideas how to do this ? :)
Edit:
I have tried this based on this answer:
df_2= df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]
And it's true that I get less rows in my dataframe (it has 695 and I got now 402) , but, I still have the first lines like this:
index name_1 path1 name_2 path2
0 Roy path/to/Roy Anne path/to/Anne
1 Anne path/to/Anne Roy path/to/Roy
...
meaning I still get the same issue
You can use frozenset
to detect duplicates:
out = (df[~df[['name_1', 'name_2']].agg(frozenset, axis=1).duplicated()]
.loc[lambda x: ~x['path2'].isin(x['path1'])])
# OR
out = (df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]
.query('~path1.isin(path2)'))
Output:
>>> out
name_1 path1 name_2 path2
0 Roy path/to/Roy Anne path/to/Anne
2 Hari path/to/Hari Wili path/to/Wili
5 Miko path/to/miko Dan path/to/dan
7 Lin path/to/lin Dan path/to/dan