Search code examples
pythonpandasduplicatesdrop

Python Pandas drop row duplicates on a column if no duplicate on other column


I have this df with email headers. I need to eliminate all duplicates where Subject is the same AND Source is different. I have spent hours trying to figure out a solution or find a similar case...

Date From Subject Source
12/06/21 Sender1 Test123 Inbox
12/06/21 Sender2 Confirm Inbox
12/06/21 Sender1 Test123 Sent
12/06/21 Sender3 Test_on Inbox
12/06/21 Sender3 Test_on Inbox

Practically from the table above the rows with subject = 'Test123' should be dropped.

Date From Subject Source
12/06/21 Sender2 Confirm Inbox
12/06/21 Sender3 Test_on Inbox
12/06/21 Sender3 Test_on Inbox

Solution

  • You can use set to determine for each sender if there are multiple source. If yes, drop the row.

    >>> df.loc[df.groupby('From')['Source'].transform(lambda x: len(set(x)) == 1)]
    
           Date     From  Subject Source
    1  12/06/21  Sender2  Confirm  Inbox
    3  12/06/21  Sender3  Test_on  Inbox
    4  12/06/21  Sender3  Test_on  Inbox