I have a dataset with 3 target classes: ‘Yes’, ‘Maybe’, and ‘No’.
Unique_id target
111 Yes
111 Maybe
111 No
112 No
112 Maybe
113 No
I want to drop duplicate rows based on unique_id. But ‘drop duplicates’ generally keeps the first or last row, and I want to keep the rows based on following the conditions:
1) If unique_id has all the 3 classes (Yes, Maybe and No), we’ll keep only the ‘Yes’ class.
2) If unique_id has the 2 classes (Maybe and No), we’ll keep only the ‘Maybe’ class.
3) We’ll keep the ‘No’ class when only ‘No’ will be there.
I tried ‘sort_values’ the target class (Yes=1, Maybe=2, No=3) and then dropped the duplicates.
Desired output:
Unique_id target
111 Yes
112 Maybe
113 No
I’m thinking if there are better ways to do that.
Any suggestions would be appreciated. Thanks!
You can set column target
as a Categorical data type by pd.CategoricalDtype
with the ordering of ['Yes' < 'Maybe' < 'No']
, as follows:
t = pd.CategoricalDtype(categories=['Yes', 'Maybe', 'No'], ordered=True)
df['target'] = df['target'].astype(t)
Then, you group by Unique_id
using .groupby()
and take the min
on target
within the group of same Unique_id
using .GroupBy.min()
:
df.groupby('Unique_id', as_index=False)['target'].min()
Result:
Unique_id target
0 111 Yes
1 112 Maybe
2 113 No
Case 1: If you have 2 or more similar columns (e.g. target
and target2
) to sort in the same ordering, you need just apply the codes to the 2 columns. For example, if we have the following dataframe:
Unique_id target target2
0 111 Yes No
1 111 Maybe Maybe
2 111 No Yes
3 112 No No
4 112 Maybe Maybe
5 113 No Maybe
You can get the min of the 2 columns simultaneously, as follows:
t = pd.CategoricalDtype(categories=['Yes', 'Maybe', 'No'], ordered=True)
df[['target', 'target2']] = df[['target', 'target2']].astype(t)
df.groupby('Unique_id', as_index=False)[['target', 'target2']].min()
Result:
Unique_id target target2
0 111 Yes Yes
1 112 Maybe Maybe
2 113 No Maybe
Case 2: If you want to display all columns in the dataframe instead of just Unique_id
and target
columns, you can use an even simpler syntax, as follows:
Another dataframe example:
Unique_id target Amount
0 111 Yes 123
1 111 Maybe 456
2 111 No 789
3 112 No 1234
4 112 Maybe 5678
5 113 No 25
Then, to show all columns with target
with min values for an Unique_id
, you can use:
t = pd.CategoricalDtype(categories=['Yes', 'Maybe', 'No'], ordered=True)
df['target'] = df['target'].astype(t)
df.loc[df.groupby('Unique_id')['target'].idxmin()]
Result:
Unique_id target Amount
0 111 Yes 123
4 112 Maybe 5678
5 113 No 25