Search code examples
pythonpandasdataframefilteringsubset

Select a subset of a dataframe based on conditions : category and note


I am trying to Select a subset of a dataframe where following conditions are satisfied:

  • for same category keep only the row with highest note,
  • if category=na keep the row

Here's my dataframe example :

enter image description here

The expected result:

enter image description here

What is the efficient way to do it? Thank you


Solution

  • Use DataFrame.sort_values with mask chained DataFrame.duplicated with bitwise OR for missing rows by same column:

    df1 = df.sort_values(['category_id','note'])
    
    df1 = df1[~df1.duplicated(['category_id'], keep='last') | 
               df1['category_id'].isna()].sort_index()
    print (df1)
       book_id category_id  note
    1      id2          c1     5
    4      id5         NaN     1
    5      id6         NaN     7
    6      id7          c2     6
    7      id8          c3     2
    8      id9         NaN     8
    9     id10         NaN     4
    10    id11         NaN     9
    

    Or use Series.fillna with range (necessary non integers in category_id) and then use DataFrameGroupBy.idxmax:

    s = df['category_id'].fillna(pd.Series(range(len(df)), index=df.index))
    
    df1 = df.loc[df.groupby(s)['note'].idxmax()].sort_index()
    print (df1)
       book_id category_id  note
    1      id2          c1     5
    4      id5         NaN     1
    5      id6         NaN     7
    6      id7          c2     6
    7      id8          c3     2
    8      id9         NaN     8
    9     id10         NaN     4
    10    id11         NaN     9