Search code examples
pythonpandasdataframegroup-bygreatest-n-per-group

Sorting columns and selecting top n rows in each group pandas dataframe


I have a dataframe like this:

mainid  pidx    pidy   score
  1      a        b      2
  1      a        c      5
  1      c        a      7
  1      c        b      2
  1      a        e      8
  2      x        y      1
  2      y        z      3
  2      z        y      5
  2      x        w      12
  2      x        v      1
  2      y        x      6  

I want to groupby on column 'pidx' and then sort score in descending order in each group i.e for each pidx

and then select head(2) i.e top 2 from each group.

The result I am looking for is like this:

mainid   pidx    pidy    score
  1        a      e        8
  1        a      c        5
  1        c      a        7
  1        c      b        2
  2        x      w        12
  2        x      y        1
  2        y      x        6
  2        y      z        3
  2        z      y        5

What I tried was:

df.sort(['pidx','score'],ascending = False).groupby('pidx').head(2) 

and this seems to work, but I dont know if it's the right approach if working on a huge dataset. What other best method can I use to get such result?


Solution

  • There are 2 solutions:

    1.sort_values and aggregate head:

    df1 = df.sort_values('score',ascending = False).groupby('pidx').head(2)
    print (df1)
    
        mainid pidx pidy  score
    8        2    x    w     12
    4        1    a    e      8
    2        1    c    a      7
    10       2    y    x      6
    1        1    a    c      5
    7        2    z    y      5
    6        2    y    z      3
    3        1    c    b      2
    5        2    x    y      1
    

    2.set_index and aggregate nlargest:

    df = df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index() 
    print (df)
      pidx  mainid pidy  score
    0    a       1    e      8
    1    a       1    c      5
    2    c       1    a      7
    3    c       1    b      2
    4    x       2    w     12
    5    x       2    y      1
    6    y       2    x      6
    7    y       2    z      3
    8    z       2    y      5
    

    Timings:

    np.random.seed(123)
    N = 1000000
    
    L1 = list('abcdefghijklmnopqrstu')
    L2 = list('efghijklmnopqrstuvwxyz')
    df = pd.DataFrame({'mainid':np.random.randint(1000, size=N),
                       'pidx': np.random.randint(10000, size=N),
                       'pidy': np.random.choice(L2, N),
                       'score':np.random.randint(1000, size=N)})
    #print (df)
    
    def epat(df):
        grouped = df.groupby('pidx')
        new_df = pd.DataFrame([], columns = df.columns)
        for key, values in grouped:
            new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0)
        return (new_df)
    
    print (epat(df))
    
    In [133]: %timeit (df.sort_values('score',ascending = False).groupby('pidx').head(2))
    1 loop, best of 3: 309 ms per loop
    
    In [134]: %timeit (df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index())
    1 loop, best of 3: 7.11 s per loop
    
    In [147]: %timeit (epat(df))
    1 loop, best of 3: 22 s per loop