Search code examples
rsortingrankingtapply

Rank a sorted dataset using apply function


My dataframe looks like this:

head(temp$HName)

[1] "UNIVERSITY OF TEXAS HEALTH SCIENCE CENTER AT TYLER"
[2] "METHODIST HOSPITAL,THE"                            
[3] "TOMBALL REGIONAL MEDICAL CENTER"                   
[4] "METHODIST SUGAR LAND HOSPITAL"                     
[5] "GULF COAST MEDICAL CENTER"                         
[6] "VHS HARLINGEN HOSPITAL COMPANY LLC"   

head(temp$Rate)

[1] 7.3 8.3 8.7 8.7 8.8 8.9
76 Levels: 7.3 8.3 8.7 8.8 8.9 9 9.1 9.2 9.3 9.4 9.5 9.6 ... 17.1

> head(temp$Rank)
[1] NA NA NA NA NA NA

The temp$Rate is sorted. I am trying to write a function assignRank which gives me a new column temp$Rank which has values as 1, 2, 3, 3, 4, 5

My code is as below:

tapply(temp$Rank,temp$Rate, assignRank)

where :

    assignRank<- function(r=1){
      temp$Rank <- r
      r <- r + 1
      return(r)
    }

I get following error when running tapply

   tapply(temp$Rank,temp$Rate, assignRank)
 Show Traceback

 Rerun with Debug
 Error in `$<-.data.frame`(`*tmp*`, "Rank", value = c(NA, NA)) : 
  replacement has 2 rows, data has 301 

Please advise where I am going wrong?


Solution

  • I use data.table for stuff like this, because both sorting and ranking are very efficient/simple syntax

    library(data.table)
    setkey(setDT(temp), Rate) # This will sort your data set by Rate in case it's not yet sorted
    temp[, Rank := .GRP, by = Rate]
    temp
    #                                                 HName Rate Rank
    # 1: UNIVERSITY OF TEXAS HEALTH SCIENCE CENTER AT TYLER  7.3    1
    # 2:                             METHODIST HOSPITAL,THE  8.3    2
    # 3:                    TOMBALL REGIONAL MEDICAL CENTER  8.7    3
    # 4:                      METHODIST SUGAR LAND HOSPITAL  8.7    3
    # 5:                          GULF COAST MEDICAL CENTER  8.8    4
    # 6:                 VHS HARLINGEN HOSPITAL COMPANY LLC  8.9    5
    

    Or you could easily do the same using base R (assuming your data is sorted by Rank) just do

    as.numeric(factor(temp$Rate))
    ## [1] 1 2 3 3 4 5
    

    Or could also use dense_rank function from dplyr package (which will not require sorting the data set)

    library(dplyr)
    temp %>% 
      mutate(Rank = dense_rank(Rate))
    #                                                HName Rate Rank
    # 1 UNIVERSITY OF TEXAS HEALTH SCIENCE CENTER AT TYLER  7.3    1
    # 2                             METHODIST HOSPITAL,THE  8.3    2
    # 3                    TOMBALL REGIONAL MEDICAL CENTER  8.7    3
    # 4                      METHODIST SUGAR LAND HOSPITAL  8.7    3
    # 5                          GULF COAST MEDICAL CENTER  8.8    4
    # 6                 VHS HARLINGEN HOSPITAL COMPANY LLC  8.9    5