Search code examples
pandasdataframegroup-by

Remove rownumber column after pandas groupby/apply


I have this data (multiple queries, each query has multiple variants, each variant has multiple durations):

import pandas as pd

df = pd.DataFrame({
    "query":    ["q1", "q1", "q1", "q1", "q2" ],
    "variant":  [ "a",  "a",  "b",  "c",  "a" ],
    "duration": [ 50,   40,   30,   20,   90  ],
})

And I would like to have a dataframe that contains for each query, for each variant the fastest duration, and variants in each query are sorted by their duration.

This almost works:

df.groupby("query").apply(
    lambda x: x.groupby("variant").apply(
        lambda x: x.sort_values("duration").head(1)
    , include_groups=False).sort_values("duration")
, include_groups=False)

enter image description here

But I'd like to remove the marked rownumber column. I'm also wondering if there is a better way to get this result.


Solution

  • Don't use nested groupby, this is highly inefficient, instead, sort_values and drop_duplicates:

    out = (df.sort_values(by=['query', 'duration', 'variant'])
             .drop_duplicates(['query', 'variant'], ignore_index=True)
           )
    

    Output:

      query variant  duration
    0    q1       c        20
    1    q1       b        30
    2    q1       a        40
    3    q2       a        90
    

    Or, if the final order of the rows doesn't matter, groupby.idxmin:

    out = df.loc[df.groupby(['query', 'variant'])['duration'].idxmin()]
    

    Output:

      query variant  duration
    1    q1       a        40
    2    q1       b        30
    3    q1       c        20
    4    q2       a        90
    

    For the sake of completeness, you would have needed to add a droplevel to your approach:

    df.groupby("query").apply(
        lambda x: x.groupby("variant").apply(
            lambda x: x.sort_values("duration").head(1)
        , include_groups=False).sort_values("duration")
    , include_groups=False).droplevel(-1)
    

    Output:

                   duration
    query variant          
    q1    c              20
          b              30
          a              40
    q2    a              90