Search code examples
pythonpandasloopsduplicatesdata-cleaning

Keep populated duplicate rows while removing those with NULL field


I have a dataframe in Python with 3 columns with some duplicate rows. Within the duplicate rows, all fields are the same except for one (which is null).

Example data:

ID Country Trades
P11 France 5
P12 Germany 3
P13 UK 7
P13 NULL 7
P14 Croatia 13
P15 USA 5
P15 NULL 5
P16 UK 2
P16 NULL 2

I want to remove the duplicate that contain the NULL value, while keeping the one that has no NULLs in the 'Country' column. Note: it is not only the top duplicate row that has the data while the bottom contains the NULL, sometimes this inverses (top duplicate contains the NULL while bottom duplicate has the country data).

Hoping to clean dataframe down to get this:

ID Country Trades
P11 France 5
P12 Germany 3
P13 UK 7
P14 Croatia 13
P15 USA 5
P16 UK 2

Where the duplicates have been removed with no data loss (in the Country column). Many thanks in advance you legends!


Solution

  • Hello and welcome vegeratian_python. Here's an example of how to solve it:

    # Sort by 'Country' and put NAs always first
    df_sorted = df.sort_values(by='Country', na_position='last')
    
    # Drop duplicates keeping the last occurrence
    df_cleaned = df_sorted.drop_duplicates(subset=['ID'], keep='last')
    
    print(df_cleaned)