I have a dataframe in Python with 3 columns with some duplicate rows. Within the duplicate rows, all fields are the same except for one (which is null).
Example data:
ID | Country | Trades |
---|---|---|
P11 | France | 5 |
P12 | Germany | 3 |
P13 | UK | 7 |
P13 | NULL | 7 |
P14 | Croatia | 13 |
P15 | USA | 5 |
P15 | NULL | 5 |
P16 | UK | 2 |
P16 | NULL | 2 |
I want to remove the duplicate that contain the NULL value, while keeping the one that has no NULLs in the 'Country' column. Note: it is not only the top duplicate row that has the data while the bottom contains the NULL, sometimes this inverses (top duplicate contains the NULL while bottom duplicate has the country data).
Hoping to clean dataframe down to get this:
ID | Country | Trades |
---|---|---|
P11 | France | 5 |
P12 | Germany | 3 |
P13 | UK | 7 |
P14 | Croatia | 13 |
P15 | USA | 5 |
P16 | UK | 2 |
Where the duplicates have been removed with no data loss (in the Country column). Many thanks in advance you legends!
Hello and welcome vegeratian_python. Here's an example of how to solve it:
# Sort by 'Country' and put NAs always first
df_sorted = df.sort_values(by='Country', na_position='last')
# Drop duplicates keeping the last occurrence
df_cleaned = df_sorted.drop_duplicates(subset=['ID'], keep='last')
print(df_cleaned)