Search code examples
pythonpandasdataframedrop

Pandas: Drop NA rows based on shared column values


I have the following dataframe

enter image description here

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "Country": ["A", "A", "A", "A", "B", "B", "B", "B"],
    "Year": [2020, 2020, 2021, 2021, 2020, 2020, 2021, 2021],
    "Category": [1, 2, 1, 2, 1, 2, 1, 2],
    "Count": [np.nan, np.nan, 1, 2, 3, np.nan, 5, 6]
})

I want to drop all values that share values for Country and Year column and have a NaN value in column Count. So in this case, row ids 0 and 1 are to be removed (note that row 5 should not be removed).

Can this be achieved without looping, with some inbuilt pandas function?

The following code achieves the desired result, but it is quite inefficient (the real dataframe is much larger):

for country in df.Country.unique():
    for year in df.Year.unique():
        if df[(df.Country==country) & (df.Year==year)].Count.isna().all(): 
            df.drop(df[(df.Country==country) & (df.Year==year)].index, inplace=True)

Is there a better, more efficient approach?


Solution

  • You can use groupby and filter to keep only groups where 'not every count is null'.

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({
        "Country": ["A", "A", "A", "A", "B", "B", "B", "B"],
        "Year": [2020, 2020, 2021, 2021, 2020, 2020, 2021, 2021],
        "Category": [1, 2, 1, 2, 1, 2, 1, 2],
        "Count": [np.nan, np.nan, 1, 2, 3, np.nan, 5, 6]
    })
    
    df.groupby(['Country','Year']).filter(lambda x: ~x['Count'].isnull().all())
    

    Output

    Country  Year  Category  Count
    2       A  2021         1    1.0
    3       A  2021         2    2.0
    4       B  2020         1    3.0
    5       B  2020         2    NaN
    6       B  2021         1    5.0
    7       B  2021         2    6.0