Search code examples
rgroupingrank

Add a new column with a sortorder grouped on column but in original sort order


I've got a dataset that I want to order in a certain way, I need it to be grouped by CENTER but other than that the records must remain the same sort order as the original dataset. So VUMC is the first and KUL is second, so that should stay in the same order.

Admittedly I'm not as experienced in R, but in SQL (which I can't use here) I would just do a DENSE_RANK to get the desired result. In R there a base rank function, and the dplyr library has a dense_rank function, but for some reason I can't get it to work.

Here is the code I tried

df_results <- read.table(text = "
CENTER;FILE;HB;LineNr
VUMC;NPUT_00172.MRC;12.7;1
VUMC;PHYM_00577.MRC;11.9;2
VUMC;CPNU_00625.MRC;9.2;3
KUL;UYTC_00146.MRC;10.8;4
KUL;UKFA_00766.MRC;10.7;5
LOR;RFAF_00105.MRC;12.9;6
LOR;ARFW_00961.MRC;12.4;7
KUL;HMCU_00630.MRC;8.6;8
VUMC;FPRB_00509.MRC;10.8;9
LOR;TWUA_00651.MRC;10.6;10
", header = TRUE, sep = ";", na.strings = "")

library(dplyr)

df_results <- df_results %>%
  group_by(CENTER) %>%
  mutate(min_LineNr = min(LineNr),
     dense_linenr = rank(min_LineNr),
     NewSortOrder = (dense_linenr*1000 + LineNr) )
     

The min(LineNr) part works as expected, but the rank results in 2.0 and 2.5 which I can't quite explain. I've also tried the mutate(rank..) as a separate statement, but that doesn't make any difference. And I've also tried dense_rank which just results in all 1's for the new column, which again was not what I was expecting.

CENTER FILE HB LineNr min_LineNr dense_linenr NewSortOrder
VUMC NPUT_00172.MRC 12.7 1 1 2.5 2501
VUMC PHYM_00577.MRC 11.9 2 1 2.5 2502
VUMC CPNU_00625.MRC 9.2 3 1 2.5 2503
KUL UYTC_00146.MRC 10.8 4 4 2 2004
KUL UKFA_00766.MRC 10.7 5 4 2 2005
LOR RFAF_00105.MRC 12.9 6 6 2 2006
LOR ARFW_00961.MRC 12.4 7 6 2 2007
KUL HMCU_00630.MRC 8.6 8 4 2 2008
VUMC FPRB_00509.MRC 10.8 9 1 2.5 2509
LOR TWUA_00651.MRC 10.6 10 6 2 2010

The desired result should ideally be like the table below, so NewSortOrder contains a sort order which is grouped by CENTER but within those centers it's sorted on the original lines order.

CENTER FILE HB LineNr NewSortOrder
VUMC NPUT_00172.MRC 12.7 1 1001
VUMC PHYM_00577.MRC 11.9 2 1002
VUMC CPNU_00625.MRC 9.2 3 1003
VUMC FPRB_00509.MRC 10.8 9 1004
KUL UYTC_00146.MRC 10.8 4 2001
KUL UKFA_00766.MRC 10.7 5 2002
KUL HMCU_00630.MRC 8.6 8 2003
LOR RFAF_00105.MRC 12.9 6 3001
LOR ARFW_00961.MRC 12.4 7 3002
LOR TWUA_00651.MRC 10.6 10 3003

My questions are: Why do rank and dense_rank give these strange results? Why doesn't it do what it says on the tin, so to speak? And what R code or library is the most straight forward way to get the desired resulting column?


Solution

  • arrange() would be the desired method to sort rows of a data.frame. It arranges text values alphabeitally, but you can change this behavour by changing the column into a factor with levels in their natural order like so:

    df_results %>%
        mutate(CENTER=factor(CENTER,levels=unique(CENTER))) %>%
        group_by(CENTER) %>%
        mutate(NewSortOrder = as.numeric(CENTER) * 1000 + row_number()) %>%
        arrange(CENTER)
    

    Produces:

    # A tibble: 10 × 5
    # Groups:   CENTER [3]
       CENTER FILE              HB LineNr NewSortOrder
       <fct>  <chr>          <dbl>  <int>        <dbl>
     1 VUMC   NPUT_00172.MRC  12.7      1         1001
     2 VUMC   PHYM_00577.MRC  11.9      2         1002
     3 VUMC   CPNU_00625.MRC   9.2      3         1003
     4 VUMC   FPRB_00509.MRC  10.8      9         1004
     5 KUL    UYTC_00146.MRC  10.8      4         2001
     6 KUL    UKFA_00766.MRC  10.7      5         2002
     7 KUL    HMCU_00630.MRC   8.6      8         2003
     8 LOR    RFAF_00105.MRC  12.9      6         3001
     9 LOR    ARFW_00961.MRC  12.4      7         3002
    10 LOR    TWUA_00651.MRC  10.6     10         3003
    

    Furthermore, rank() does exactly what it says it should do, consider:

    rank( c("B","A","C") ) # gives: 2 1 3
    

    With ties (as in your case) it does the average. Consider for example your first group, VUMC, where min_lineNr is four 1's:

    rank( c(1,1,1,1) ) # gives 2.5 2.5 2.5 2.5
    

    4 different numbers would normaly receive ranks 1 2 3 4 (in some order). The default is to do the average of these ranks when there are ties, hence 2.5 (average of ranks c(1,2,3,4))

    You can easily track and make sense of this behavour by trying various sized vectors of equal numbers:

    rank( 1 )          # 1
    rank( c(1,1) )     # 1.5 1.5 (mean of c(1,2))
    rank( c(1,1,1) )   # 2 2 2 (mean of c(1,2,3))
    rank( c(1,1,1,1) ) # 2.5 2.5 2.5 2.5 (mean of c(1,2,3,4))
    # etc.
    

    The other strategies to deal with ties are given in ?rank