python pandas dataframe filtering subset

Select a subset of a dataframe based on conditions : category and note

I am trying to Select a subset of a dataframe where following conditions are satisfied:

for same category keep only the row with highest note,
if category=na keep the row

Here's my dataframe example :

The expected result:

What is the efficient way to do it? Thank you

Solution

Use DataFrame.sort_values with mask chained DataFrame.duplicated with bitwise OR for missing rows by same column:

df1 = df.sort_values(['category_id','note'])

df1 = df1[~df1.duplicated(['category_id'], keep='last') | 
           df1['category_id'].isna()].sort_index()
print (df1)
   book_id category_id  note
1      id2          c1     5
4      id5         NaN     1
5      id6         NaN     7
6      id7          c2     6
7      id8          c3     2
8      id9         NaN     8
9     id10         NaN     4
10    id11         NaN     9

Or use Series.fillna with range (necessary non integers in category_id) and then use DataFrameGroupBy.idxmax:

s = df['category_id'].fillna(pd.Series(range(len(df)), index=df.index))

df1 = df.loc[df.groupby(s)['note'].idxmax()].sort_index()
print (df1)
   book_id category_id  note
1      id2          c1     5
4      id5         NaN     1
5      id6         NaN     7
6      id7          c2     6
7      id8          c3     2
8      id9         NaN     8
9     id10         NaN     4
10    id11         NaN     9