I have this data (multiple queries, each query has multiple variants, each variant has multiple durations):
import pandas as pd
df = pd.DataFrame({
"query": ["q1", "q1", "q1", "q1", "q2" ],
"variant": [ "a", "a", "b", "c", "a" ],
"duration": [ 50, 40, 30, 20, 90 ],
})
And I would like to have a dataframe that contains for each query, for each variant the fastest duration, and variants in each query are sorted by their duration.
This almost works:
df.groupby("query").apply(
lambda x: x.groupby("variant").apply(
lambda x: x.sort_values("duration").head(1)
, include_groups=False).sort_values("duration")
, include_groups=False)
But I'd like to remove the marked rownumber column. I'm also wondering if there is a better way to get this result.
Don't use nested groupby
, this is highly inefficient, instead, sort_values
and drop_duplicates
:
out = (df.sort_values(by=['query', 'duration', 'variant'])
.drop_duplicates(['query', 'variant'], ignore_index=True)
)
Output:
query variant duration
0 q1 c 20
1 q1 b 30
2 q1 a 40
3 q2 a 90
Or, if the final order of the rows doesn't matter, groupby.idxmin
:
out = df.loc[df.groupby(['query', 'variant'])['duration'].idxmin()]
Output:
query variant duration
1 q1 a 40
2 q1 b 30
3 q1 c 20
4 q2 a 90
For the sake of completeness, you would have needed to add a droplevel
to your approach:
df.groupby("query").apply(
lambda x: x.groupby("variant").apply(
lambda x: x.sort_values("duration").head(1)
, include_groups=False).sort_values("duration")
, include_groups=False).droplevel(-1)
Output:
duration
query variant
q1 c 20
b 30
a 40
q2 a 90