Search code examples
pythonpandasdataframedata-manipulationdrop-duplicates

Drop Duplicate Rows Based on Target Class Conditions


I have a dataset with 3 target classes: ‘Yes’, ‘Maybe’, and ‘No’.

Unique_id       target
111              Yes
111             Maybe
111              No
112              No
112             Maybe
113              No

I want to drop duplicate rows based on unique_id. But ‘drop duplicates’ generally keeps the first or last row, and I want to keep the rows based on following the conditions:

1) If unique_id has all the 3 classes (Yes, Maybe and No), we’ll keep only the ‘Yes’ class.
2) If unique_id has the 2 classes (Maybe and No), we’ll keep only the ‘Maybe’ class.
3) We’ll keep the ‘No’ class when only ‘No’ will be there.

I tried ‘sort_values’ the target class (Yes=1, Maybe=2, No=3) and then dropped the duplicates.

Desired output:

Unique_id       target
111               Yes
112              Maybe
113               No

I’m thinking if there are better ways to do that.

Any suggestions would be appreciated. Thanks!


Solution

  • You can set column target as a Categorical data type by pd.CategoricalDtype with the ordering of ['Yes' < 'Maybe' < 'No'], as follows:

    t = pd.CategoricalDtype(categories=['Yes', 'Maybe', 'No'], ordered=True)
    df['target'] = df['target'].astype(t)
    

    Then, you group by Unique_id using .groupby() and take the min on target within the group of same Unique_id using .GroupBy.min():

    df.groupby('Unique_id', as_index=False)['target'].min()
    

    Result:

       Unique_id target
    0        111    Yes
    1        112  Maybe
    2        113     No
    

    Edit

    Case 1: If you have 2 or more similar columns (e.g. target and target2) to sort in the same ordering, you need just apply the codes to the 2 columns. For example, if we have the following dataframe:

       Unique_id target target2
    0        111    Yes      No
    1        111  Maybe   Maybe
    2        111     No     Yes
    3        112     No      No
    4        112  Maybe   Maybe
    5        113     No   Maybe
    

    You can get the min of the 2 columns simultaneously, as follows:

    t = pd.CategoricalDtype(categories=['Yes', 'Maybe', 'No'], ordered=True)
    df[['target', 'target2']] = df[['target', 'target2']].astype(t)
    
    df.groupby('Unique_id', as_index=False)[['target', 'target2']].min()
    

    Result:

       Unique_id target target2
    0        111    Yes     Yes
    1        112  Maybe   Maybe
    2        113     No   Maybe
    

    Case 2: If you want to display all columns in the dataframe instead of just Unique_id and target columns, you can use an even simpler syntax, as follows:

    Another dataframe example:

       Unique_id target  Amount
    0        111    Yes     123
    1        111  Maybe     456
    2        111     No     789
    3        112     No    1234
    4        112  Maybe    5678
    5        113     No      25
    

    Then, to show all columns with target with min values for an Unique_id, you can use:

    t = pd.CategoricalDtype(categories=['Yes', 'Maybe', 'No'], ordered=True)
    df['target'] = df['target'].astype(t)
    
    df.loc[df.groupby('Unique_id')['target'].idxmin()]
    

    Result:

       Unique_id target  Amount
    0        111    Yes     123
    4        112  Maybe    5678
    5        113     No      25