Search code examples
rloopsdataframedplyrtop-n

Selecting top N rows for each group based on value in column


I have dataframe like below :-

x<-c(3,2,1,8,7,11,10,9,7,5,4)
y<-c("a","a","a", "b","b","c","c","c","c","c","c")
z<-c(2,2,2,1,1,3,3,3,3,3,3)
df<-data.frame(x,y,z)

df
    x y z
1   3 a 2
2   2 a 2
3   1 a 2
4   8 b 1
5   7 b 1
6  11 c 3
7  10 c 3
8   9 c 3
9   7 c 3
10  5 c 3
11  4 c 3

I want to select top n row for each group by column y where n is provided in column z. So the output should be like :

output:
       x   y  z
     1 3   a  2
     2 2   a  2
     3 8   b  1
     4 11  c  3
     5 10  c  3
     6 9   c  3

Solution

  • A solution with base R:

    # df is split according to y, then we keep only the top "z" value (after ordering x) 
    # and rbind everything back together:
    do.call(rbind, 
            lapply(split(df, df$y), 
                   function(df1) df1[order(df1$x, decreasing=TRUE), ][1:unique(df1$z), ]))
    #     x y z
    #a.1  3 a 2
    #a.2  2 a 2
    #b    8 b 1
    #c.6 11 c 3
    #c.7 10 c 3
    #c.8  9 c 3
    

    EDIT:
    A much more direct way (still in base R) provided in comment by @mt1022:

    df[ave(1:nrow(df), df$y, FUN = seq_along) <= df$z, ]
    #   x y z
    #1  3 a 2
    #2  2 a 2
    #4  8 b 1
    #6 11 c 3
    #7 10 c 3
    #8  9 c 3