I've got a dataset that I want to order in a certain way, I need it to be grouped by CENTER
but other than that the records must remain the same sort order as the original dataset. So VUMC is the first and KUL is second, so that should stay in the same order.
Admittedly I'm not as experienced in R, but in SQL (which I can't use here) I would just do a DENSE_RANK
to get the desired result. In R there a base rank
function, and the dplyr library has a dense_rank
function, but for some reason I can't get it to work.
Here is the code I tried
df_results <- read.table(text = "
CENTER;FILE;HB;LineNr
VUMC;NPUT_00172.MRC;12.7;1
VUMC;PHYM_00577.MRC;11.9;2
VUMC;CPNU_00625.MRC;9.2;3
KUL;UYTC_00146.MRC;10.8;4
KUL;UKFA_00766.MRC;10.7;5
LOR;RFAF_00105.MRC;12.9;6
LOR;ARFW_00961.MRC;12.4;7
KUL;HMCU_00630.MRC;8.6;8
VUMC;FPRB_00509.MRC;10.8;9
LOR;TWUA_00651.MRC;10.6;10
", header = TRUE, sep = ";", na.strings = "")
library(dplyr)
df_results <- df_results %>%
group_by(CENTER) %>%
mutate(min_LineNr = min(LineNr),
dense_linenr = rank(min_LineNr),
NewSortOrder = (dense_linenr*1000 + LineNr) )
The min(LineNr)
part works as expected, but the rank
results in 2.0
and 2.5
which I can't quite explain.
I've also tried the mutate(rank..)
as a separate statement, but that doesn't make any difference.
And I've also tried dense_rank
which just results in all 1
's for the new column, which again was not what I was expecting.
CENTER | FILE | HB | LineNr | min_LineNr | dense_linenr | NewSortOrder |
---|---|---|---|---|---|---|
VUMC | NPUT_00172.MRC | 12.7 | 1 | 1 | 2.5 | 2501 |
VUMC | PHYM_00577.MRC | 11.9 | 2 | 1 | 2.5 | 2502 |
VUMC | CPNU_00625.MRC | 9.2 | 3 | 1 | 2.5 | 2503 |
KUL | UYTC_00146.MRC | 10.8 | 4 | 4 | 2 | 2004 |
KUL | UKFA_00766.MRC | 10.7 | 5 | 4 | 2 | 2005 |
LOR | RFAF_00105.MRC | 12.9 | 6 | 6 | 2 | 2006 |
LOR | ARFW_00961.MRC | 12.4 | 7 | 6 | 2 | 2007 |
KUL | HMCU_00630.MRC | 8.6 | 8 | 4 | 2 | 2008 |
VUMC | FPRB_00509.MRC | 10.8 | 9 | 1 | 2.5 | 2509 |
LOR | TWUA_00651.MRC | 10.6 | 10 | 6 | 2 | 2010 |
The desired result should ideally be like the table below, so NewSortOrder contains a sort order which is grouped by CENTER
but within those centers it's sorted on the original lines order.
CENTER | FILE | HB | LineNr | NewSortOrder |
---|---|---|---|---|
VUMC | NPUT_00172.MRC | 12.7 | 1 | 1001 |
VUMC | PHYM_00577.MRC | 11.9 | 2 | 1002 |
VUMC | CPNU_00625.MRC | 9.2 | 3 | 1003 |
VUMC | FPRB_00509.MRC | 10.8 | 9 | 1004 |
KUL | UYTC_00146.MRC | 10.8 | 4 | 2001 |
KUL | UKFA_00766.MRC | 10.7 | 5 | 2002 |
KUL | HMCU_00630.MRC | 8.6 | 8 | 2003 |
LOR | RFAF_00105.MRC | 12.9 | 6 | 3001 |
LOR | ARFW_00961.MRC | 12.4 | 7 | 3002 |
LOR | TWUA_00651.MRC | 10.6 | 10 | 3003 |
My questions are:
Why do rank
and dense_rank
give these strange results? Why doesn't it do what it says on the tin, so to speak?
And what R code or library is the most straight forward way to get the desired resulting column?
arrange()
would be the desired method to sort rows of a data.frame. It arranges text values alphabeitally, but you can change this behavour by changing the column into a factor with levels in their natural order like so:
df_results %>%
mutate(CENTER=factor(CENTER,levels=unique(CENTER))) %>%
group_by(CENTER) %>%
mutate(NewSortOrder = as.numeric(CENTER) * 1000 + row_number()) %>%
arrange(CENTER)
Produces:
# A tibble: 10 × 5
# Groups: CENTER [3]
CENTER FILE HB LineNr NewSortOrder
<fct> <chr> <dbl> <int> <dbl>
1 VUMC NPUT_00172.MRC 12.7 1 1001
2 VUMC PHYM_00577.MRC 11.9 2 1002
3 VUMC CPNU_00625.MRC 9.2 3 1003
4 VUMC FPRB_00509.MRC 10.8 9 1004
5 KUL UYTC_00146.MRC 10.8 4 2001
6 KUL UKFA_00766.MRC 10.7 5 2002
7 KUL HMCU_00630.MRC 8.6 8 2003
8 LOR RFAF_00105.MRC 12.9 6 3001
9 LOR ARFW_00961.MRC 12.4 7 3002
10 LOR TWUA_00651.MRC 10.6 10 3003
Furthermore, rank()
does exactly what it says it should do, consider:
rank( c("B","A","C") ) # gives: 2 1 3
With ties (as in your case) it does the average. Consider for example your first group, VUMC
, where min_lineNr is four 1's:
rank( c(1,1,1,1) ) # gives 2.5 2.5 2.5 2.5
4 different numbers would normaly receive ranks 1 2 3 4 (in some order). The default is to do the average of these ranks when there are ties, hence 2.5 (average of ranks c(1,2,3,4))
You can easily track and make sense of this behavour by trying various sized vectors of equal numbers:
rank( 1 ) # 1
rank( c(1,1) ) # 1.5 1.5 (mean of c(1,2))
rank( c(1,1,1) ) # 2 2 2 (mean of c(1,2,3))
rank( c(1,1,1,1) ) # 2.5 2.5 2.5 2.5 (mean of c(1,2,3,4))
# etc.
The other strategies to deal with ties are given in ?rank