I have the following dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame({
"Country": ["A", "A", "A", "A", "B", "B", "B", "B"],
"Year": [2020, 2020, 2021, 2021, 2020, 2020, 2021, 2021],
"Category": [1, 2, 1, 2, 1, 2, 1, 2],
"Count": [np.nan, np.nan, 1, 2, 3, np.nan, 5, 6]
})
I want to drop all values that share values for Country
and Year
column and have a NaN
value in column Count
. So in this case, row ids 0 and 1 are to be removed (note that row 5 should not be removed).
Can this be achieved without looping, with some inbuilt pandas function?
The following code achieves the desired result, but it is quite inefficient (the real dataframe is much larger):
for country in df.Country.unique():
for year in df.Year.unique():
if df[(df.Country==country) & (df.Year==year)].Count.isna().all():
df.drop(df[(df.Country==country) & (df.Year==year)].index, inplace=True)
Is there a better, more efficient approach?
You can use groupby
and filter
to keep only groups where 'not every count is null'.
import pandas as pd
import numpy as np
df = pd.DataFrame({
"Country": ["A", "A", "A", "A", "B", "B", "B", "B"],
"Year": [2020, 2020, 2021, 2021, 2020, 2020, 2021, 2021],
"Category": [1, 2, 1, 2, 1, 2, 1, 2],
"Count": [np.nan, np.nan, 1, 2, 3, np.nan, 5, 6]
})
df.groupby(['Country','Year']).filter(lambda x: ~x['Count'].isnull().all())
Output
Country Year Category Count
2 A 2021 1 1.0
3 A 2021 2 2.0
4 B 2020 1 3.0
5 B 2020 2 NaN
6 B 2021 1 5.0
7 B 2021 2 6.0