Search code examples
pythonpandasgroup-byduplicates

Identify and mark duplicates where any of them has a specific quality


I have a data set where I need to group the duplicate records and then see if any of the group has been flagged; if so, I need to indicate that it has been flagged in a new column.

Sample input; the "Concat" column is where I'm looking for duplicates:

Customer_Reference  Auth_ID Concat  Flag
C001    A1234   AAA TRUE
C002    A758    AAA FALSE
C003    A6149   BBB FALSE
C004    A5271   CCC FALSE
C005    A0124   BBB FALSE

The behaviour I'm looking for is that "AAA" is a duplicate where at least one of them has the Flag of "TRUE"; therefore, it should be marked as "TRUE" in the (to be added) column.

I've been able to mark the duplicate records, and the groupby function seems to chunk it up correctly, but I can't work out how to then apply an "If-Then" to each group to assign a value to the new column


Solution

  • An efficient approach would be to drop_duplicates to only keep the rows with multiple occurrence of "Concat", then select those that have a True "Flag" with isin.

    keep = df.drop_duplicates('Concat').query('Flag')['Concat']
    # ['AAA']
    
    df['New'] = df['Concat'].isin(keep)
    

    As a one-liner:

    df['New'] = df['Concat'].isin(df.drop_duplicates('Concat')
                                    .query('Flag')['Concat'])
    

    Output:

      Customer_Reference Auth_ID Concat   Flag    New
    0               C001   A1234    AAA   True   True
    1               C002    A758    AAA  False   True
    2               C003   A6149    BBB  False  False
    3               C004   A5271    CCC  False  False
    4               C005   A0124    BBB  False  False
    

    Intermediates:

    # df.drop_duplicates('Concat')
      Customer_Reference Auth_ID Concat   Flag    New
    0               C001   A1234    AAA   True   True
    2               C003   A6149    BBB  False  False
    3               C004   A5271    CCC  False  False
    
    # df.drop_duplicates('Concat').query('Flag')
      Customer_Reference Auth_ID Concat  Flag   New
    0               C001   A1234    AAA  True  True
    
    
    # df.drop_duplicates('Concat').query('Flag')['Concat']
    0    AAA
    Name: Concat, dtype: object