I have this df with email headers. I need to eliminate all duplicates where Subject is the same AND Source is different. I have spent hours trying to figure out a solution or find a similar case...
Date | From | Subject | Source |
---|---|---|---|
12/06/21 | Sender1 | Test123 | Inbox |
12/06/21 | Sender2 | Confirm | Inbox |
12/06/21 | Sender1 | Test123 | Sent |
12/06/21 | Sender3 | Test_on | Inbox |
12/06/21 | Sender3 | Test_on | Inbox |
Practically from the table above the rows with subject = 'Test123' should be dropped.
Date | From | Subject | Source |
---|---|---|---|
12/06/21 | Sender2 | Confirm | Inbox |
12/06/21 | Sender3 | Test_on | Inbox |
12/06/21 | Sender3 | Test_on | Inbox |
You can use set
to determine for each sender if there are multiple source. If yes, drop the row.
>>> df.loc[df.groupby('From')['Source'].transform(lambda x: len(set(x)) == 1)]
Date From Subject Source
1 12/06/21 Sender2 Confirm Inbox
3 12/06/21 Sender3 Test_on Inbox
4 12/06/21 Sender3 Test_on Inbox