Search code examples
rcategorical-data

Change values from categorical to nominal in R


I want to change all the values in categorical columns by rank. Rank can be decided using the index of the sorted unique elements in the column.

For instance,

> data[1:5,1] 
[1] "B2" "C4" "C5" "C1" "B5"

then I want these entries in the column replacing categorical values

> data[1:5,1]  
[1] "1" "4" "5" "3" "2"

Another column:

> data[1:5,3]
[1] "Verified"        "Source Verified" "Not Verified"    "Source Verified" "Source Verified"

Then the updated column:

> data[1:5,3]
[1] "3" "2" "1" "2" "2"

I used this code for this task but it is taking a lot of time.

for(i in 1:ncol(data)){
  if(is.character(data[,i])){
    temp <- sort(unique(data[,i]))
    for(j in 1:nrow(data)){
      for(k in 1:length(temp)){
        if(data[j,i] == temp[k]){
          data[j,i] <- k}
      }
    }
  }
}

Please suggest me the efficient way to do this, if possible. Thanks.


Solution

  • Here a solution in base R. I create a helper function that convert each column to a factor using its unique sorted values as levels. This is similar to what you did except I use as.integer to get the ranking values.

    rank_fac <- function(col1) 
       as.integer(factor(col1,levels = unique(col1)))
    

    Some data example:

    dx <- data.frame(
      col1= c("B2" ,"C4" ,"C5", "C1", "B5"),
      col2=c("Verified"    ,    "Source Verified", "Not Verified"  ,  "Source Verified", "Source Verified")
    )
    

    Applying it without using a for loop. Better to use lapply here to avoid side-effect.

    data.frame(lapply(dx,rank_fac)
    

    Results:

    #       col1 col2
    # [1,]    1    3
    # [2,]    4    2
    # [3,]    5    1
    # [4,]    3    2
    # [5,]    2    2
    

    using data.table syntax-sugar

    library(data.table)
    setDT(dx)[,lapply(.SD,rank_fac)]
    #    col1 col2
    # 1:    1    3
    # 2:    4    2
    # 3:    5    1
    # 4:    3    2
    # 5:    2    2
    

    simpler solution:

    Using only as.integer :

    setDT(dx)[,lapply(.SD,as.integer)]