Search code examples
pythonpandasgroupingdrop-duplicates

How to drop_duplicate using different condition per group?


I have dataFrame and I need to drop duplicates per group ('col1') based on a minimum value in another column 'abs(col1 - col2)', but I need to change this condition for the last group by taking the max value in 'abs(col1 - col2)' that corresponding to the last group in 'col1' where I sorted the 'col1' with ascending order. (to behave as a loop)

Update 1 :

I need to assign the last group dynamically.

for example, If I have a data frame as

  • creating DataFrame

df = pd.DataFrame( {'col0':['A','A','A','A','A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B','B','B','B','B'],'col1':[1,1,1,2,2,2,3,3,3,4,4,4,2,2,2,3,3,3,4,4,4,5,5,5], 'col2':[2,3,4,1,3,4,1,2,4,1,2,3,3,4,5,2,4,5,2,3,5,2,3,4]})

compute Diff column (this column will be used as a condition)

df['abs(col1 - col2)']=abs(df['col1']-df['col2'])

  • The original Df as follow :

enter image description here

  • The desired Df should looks like:

enter image description here

  • my trial:

    df.sort_values(by=['col0','col1','abs(col1 - col2)','col2'],ascending=[True,True,True,False]).drop_duplicates(['col0','col1'])

  • the resulting as follow:

enter image description here


Solution

  • Updated:

    If I understand correctly, you want each a different sort order each time you reach a maximum on col1.

    1. Extract the groups where sorting is different:
    df.groupby(['col0'], as_index=False)['col1'].max()
    
    1. Deduplicate df as you did
    2. Deduplicate just the groups found in step 1 with the correct sorting. You can get these groups by merging with the original df:
    pd.merge(df, col1_max_groups)
    
    1. Update the deduplicated DataFrame with the new values

    Full example:

    col1_max_groups = df.groupby(['col0'], as_index=False)['col1'].max()
    deduped = df.sort_values(['col0', 'col1', 'abs(col1 - col2)', 'col2'], 
                             ascending=[True, True, True, False]) \
        .drop_duplicates(['col0', 'col1']) \
        .set_index(['col0', 'col1'])
    update = pd.merge(df, col1_max_groups) \
        .sort_values(['col0', 'col1', 'abs(col1 - col2)', 'col2'], 
                     ascending=[True, True, False, False]) \
        .drop_duplicates(['col0', 'col1'])
    deduped.update(update.set_index(['col0', 'col1']))
    deduped.reset_index()
    
    # returns
    # col0  col1  col2  abs(col1 - col2)
    #    A     1     2                 1
    #    A     2     3                 1
    #    A     3     4                 1
    #    A     4     1                 3
    #    B     2     3                 1
    #    B     3     4                 1
    #    B     4     5                 1
    #    B     5     2                 3