I am trying to Select a subset of a dataframe where following conditions are satisfied:
Here's my dataframe example :
The expected result:
What is the efficient way to do it? Thank you
Use DataFrame.sort_values
with mask chained DataFrame.duplicated
with bitwise OR
for missing rows by same column:
df1 = df.sort_values(['category_id','note'])
df1 = df1[~df1.duplicated(['category_id'], keep='last') |
df1['category_id'].isna()].sort_index()
print (df1)
book_id category_id note
1 id2 c1 5
4 id5 NaN 1
5 id6 NaN 7
6 id7 c2 6
7 id8 c3 2
8 id9 NaN 8
9 id10 NaN 4
10 id11 NaN 9
Or use Series.fillna
with range
(necessary non integers in category_id
) and then use DataFrameGroupBy.idxmax
:
s = df['category_id'].fillna(pd.Series(range(len(df)), index=df.index))
df1 = df.loc[df.groupby(s)['note'].idxmax()].sort_index()
print (df1)
book_id category_id note
1 id2 c1 5
4 id5 NaN 1
5 id6 NaN 7
6 id7 c2 6
7 id8 c3 2
8 id9 NaN 8
9 id10 NaN 4
10 id11 NaN 9