python pandas dataframe data-manipulation drop-duplicates

Drop Duplicate Rows Based on Target Class Conditions

I have a dataset with 3 target classes: ‘Yes’, ‘Maybe’, and ‘No’.

Unique_id       target
111              Yes
111             Maybe
111              No
112              No
112             Maybe
113              No

I want to drop duplicate rows based on unique_id. But ‘drop duplicates’ generally keeps the first or last row, and I want to keep the rows based on following the conditions:

1) If unique_id has all the 3 classes (Yes, Maybe and No), we’ll keep only the ‘Yes’ class.
2) If unique_id has the 2 classes (Maybe and No), we’ll keep only the ‘Maybe’ class.
3) We’ll keep the ‘No’ class when only ‘No’ will be there.

I tried ‘sort_values’ the target class (Yes=1, Maybe=2, No=3) and then dropped the duplicates.

Desired output:

Unique_id       target
111               Yes
112              Maybe
113               No

I’m thinking if there are better ways to do that.

Any suggestions would be appreciated. Thanks!

Solution

You can set column target as a Categorical data type by pd.CategoricalDtype with the ordering of ['Yes' < 'Maybe' < 'No'], as follows:

t = pd.CategoricalDtype(categories=['Yes', 'Maybe', 'No'], ordered=True)
df['target'] = df['target'].astype(t)

Then, you group by Unique_id using .groupby() and take the min on target within the group of same Unique_id using .GroupBy.min():

df.groupby('Unique_id', as_index=False)['target'].min()

Result:

   Unique_id target
0        111    Yes
1        112  Maybe
2        113     No

Edit

Case 1: If you have 2 or more similar columns (e.g. target and target2) to sort in the same ordering, you need just apply the codes to the 2 columns. For example, if we have the following dataframe:

   Unique_id target target2
0        111    Yes      No
1        111  Maybe   Maybe
2        111     No     Yes
3        112     No      No
4        112  Maybe   Maybe
5        113     No   Maybe

You can get the min of the 2 columns simultaneously, as follows:

t = pd.CategoricalDtype(categories=['Yes', 'Maybe', 'No'], ordered=True)
df[['target', 'target2']] = df[['target', 'target2']].astype(t)

df.groupby('Unique_id', as_index=False)[['target', 'target2']].min()

Result:

   Unique_id target target2
0        111    Yes     Yes
1        112  Maybe   Maybe
2        113     No   Maybe

Case 2: If you want to display all columns in the dataframe instead of just Unique_id and target columns, you can use an even simpler syntax, as follows:

Another dataframe example:

   Unique_id target  Amount
0        111    Yes     123
1        111  Maybe     456
2        111     No     789
3        112     No    1234
4        112  Maybe    5678
5        113     No      25

Then, to show all columns with target with min values for an Unique_id, you can use:

t = pd.CategoricalDtype(categories=['Yes', 'Maybe', 'No'], ordered=True)
df['target'] = df['target'].astype(t)

df.loc[df.groupby('Unique_id')['target'].idxmin()]

Result:

   Unique_id target  Amount
0        111    Yes     123
4        112  Maybe    5678
5        113     No      25