Search code examples
rdata.tablerankgroupmutate

Rank using multiple columns and within groups in R


I want to rank values within groups using two attributes. In theory frank in data.table should be able to do this, but I am running into a snag where the function does not accept two columns when put into the %>% syntax or in the data.table syntax to split into groups.

Using the mtcars dataset, I am trying to order by cyl and disp within two dummy groups. I can order the data within groups using order as suggested in another post (that I can't track down) but I cannot set a method to deal with ties in ranks. I can also use frank to order by cyl and then disp but I cannot get that function to do within the groups.

data(mtcars)

mtcars<-as.data.table(mtcars)%>%select(cyl, disp, gear)
mtcars$gr1<-rep(c("fast", "slow"), 16)
mtcars$gr2<-rep(c("red", "blue"), each=16)

This is the code that I am trying to use but which fails but I do not understand the hints in the error message:

mtcars%>%group_by(gr1, gr2)%>%
  arrange(gr1, gr2, .by.group=TRUE)%>% #by.group to keep within groups : may not be helpful for frank
  mutate(ranks=frank(-cyl, -disp, na.last=TRUE, ties.method="min"))#orders by cyl and then by disp

#Error in `mutate()`:
#  ℹ In argument: `ranks = frank(-cyl, -disp, na.last = TRUE, ties.method = "min")`.
#ℹ In group 1: `gr1 = "fast"`, `gr2 = "blue"`.
#Caused by error in `frankv()`:
#  ! x is a single vector, non-NULL 'cols' doesn't make sense
#Run `rlang::last_error()` to see where the error occurred.


#works but only uses cyl
    mtcars[, list(ranks= frank(-cyl,  na.last=TRUE, ties.method="min")), 
                                by=list(gr1, gr2)]
 #fails   
    mtcars[, list(ranks= frank(-cyl, -disp,  na.last=TRUE, ties.method="min")), 
                     by=list(gr1, gr2)]

Solution

  • Looking at the ?frank help page, it does not accept multiple vectors as input as you are trying to use it. It either accepts (a) a single vector or (b) a list, data frame, or data.table as the first argument, along with the column names (or a vector of quoted column names or indices).

    From ?frank:

    Arguments

    x A vector, or list with all its elements identical in length or data.frame or data.table.

    ... Only for lists, data.frames and data.tables. The columns to calculate ranks based on. Do not quote column names. If ... is missing, all columns are considered by default. To sort by a column in descending order prefix "-", e.g., frank(x, a, -b, c). -b works when b is of type character as well.

    Note that ... works "only for for lists, data.frames and data.tables."

    You don't want to rank a single vector, which means you need to use option (b) - give frank the data.table as its first argument, and then specify the columns. To do this by group in data.table we use .SD. (I don't think this interface will play nicely with dplyr::group_by.)

    Here's an all data.table version of my interpretation of your code. You haven't provided expected output, so I can only hope it's what you want.

    mtcars[, result := frank(.SD, -cyl, -disp, na.last = TRUE, ties.method = "min"), by = .(gr1, gr2)]
    mtcars
    #    cyl  disp gear  gr1  gr2 result
    #  1:   6 160.0    4 fast  red      6
    #  2:   6 160.0    4 slow  red      7
    #  3:   4 108.0    4 fast  red      8
    #  4:   6 258.0    3 slow  red      4
    #  5:   8 360.0    3 fast  red      2
    #  6:   6 225.0    3 slow  red      5
    #  7:   8 360.0    3 fast  red      2
    #  8:   4 146.7    4 slow  red      8
    #  9:   4 140.8    4 fast  red      7
    # 10:   6 167.6    4 slow  red      6
    # 11:   6 167.6    4 fast  red      5
    # 12:   8 275.8    3 slow  red      2
    # 13:   8 275.8    3 fast  red      4
    # 14:   8 275.8    3 slow  red      2
    # 15:   8 472.0    3 fast  red      1
    # 16:   8 460.0    3 slow  red      1
    # 17:   8 440.0    3 fast blue      1
    # 18:   4  78.7    4 slow blue      7
    # 19:   4  75.7    4 fast blue      8
    # 20:   4  71.1    4 slow blue      8
    # 21:   4 120.1    3 fast blue      7
    # 22:   8 318.0    3 slow blue      2
    # 23:   8 304.0    3 fast blue      4
    # 24:   8 350.0    3 slow blue      1
    # 25:   8 400.0    3 fast blue      2
    # 26:   4  79.0    4 slow blue      6
    # 27:   4 120.3    5 fast blue      6
    # 28:   4  95.1    5 slow blue      5
    # 29:   8 351.0    5 fast blue      3
    # 30:   6 145.0    5 slow blue      3
    # 31:   8 301.0    5 fast blue      5
    # 32:   4 121.0    4 slow blue      4
    #     cyl  disp gear  gr1  gr2 result
    

    Thanks to Jamie's comment, we can construct a list on the fly of the ranking columns which we can do within a dplyr pipeline:

    ## using list() with data.table
    mtcars[, result := frank(list(-cyl, -disp), na.last = TRUE, ties.method = "min"), by = .(gr1, gr2)]
    
    ## using list() with dplyr
    mtcars %>% 
      group_by(gr1, gr2) %>%
      mutate(result = frank(list(-cyl, -disp), na.last=TRUE, ties.method="min"))