Search code examples
r

How to convert from category to numeric in r


Here is my problem:

I have a table with categories and I want to rank them:

category
dog
cat
fish
dog
dog

What I want is to add a column and to rank them:

category       rank    
dog             1  
cat             2
fish            3
dog             1
dog             1
  • Sorry for the terrible table (help in writing normal tables in stack overflow would be great, too)
  • Any ideas about how to add the rank column?

Thanks!


Solution

  • Just for the sake of completeness and because the solution I posted in a comment is an inefficient (and pretty ugly) fix, I'll post an answer too.

    It turned out that OP's starting setting was something like the following:

    x = c("cat", "dog", "fish", "dog", "dog", "cat", "fish", "catfish")
    x = factor(x)
    

    At the end, a manually specified numerical categorization of x was wanted. As an example, let's suppose that the following matching is wanted:

    cat -> 1, dog -> 2, fish -> 3, catfish -> 4
    

    So, some alternatives:

    sapply(as.character(x), switch, "cat" = 1, "dog" = 2, "fish" = 3, "catfish" = 4, 
                                                                    USE.NAMES = F)
    #[1] 1 2 3 2 2 1 3 4
    
    match(x, c("cat", "dog", "fish", "catfish")) #note that match's internal 'do_match' 
                                                 #calls 'match_transform' that coerces
                                                 #`factor` to `character`, so no need
                                                 #for 'as.character(x)'
                                      #(http://svn.r-project.org/R/trunk/src/main/unique.c)
    #[1] 1 2 3 2 2 1 3 4
    
    local({    #just to not change 'x'
    levels(x) = list("cat" = 1, "dog" = 2, "fish" = 3, "catfish" = 4)
    as.numeric(x)
    })
    #[1] 1 2 3 2 2 1 3 4
    
    library(fastmatch)
    fmatch(x, c("cat", "dog", "fish", "catfish"))  #a faster alternative to 'match'
    #[1] 1 2 3 2 2 1 3 4
    

    And a benchmarking on a larger vector:

    X = rep(as.character(x), 1e5)
    X = factor(X)
    f1 = function() sapply(as.character(X), switch, 
                "cat" = 1, "dog" = 2, "fish" = 3, "catfish" = 4, USE.NAMES = F)
    f2 = function() match(X, c("cat", "dog", "fish", "catfish")) 
    f3 = function() {levels(X) = list("cat" = 1, "dog" = 2, "fish" = 3, "catfish" = 4) ;
                                                           as.numeric(X)}
    library(fastmatch)
    f4 = function() fmatch(X, c("cat", "dog", "fish", "catfish"))
    
    library(microbenchmark)
    microbenchmark(f1(), f2(), f3(), f4(), times = 10)
    #Unit: milliseconds
    # expr         min          lq      median         uq       max neval
    # f1() 1745.111666 1816.675337 1961.809102 2107.98236 2896.0291    10
    # f2()   22.043657   22.786647   23.987263   31.45057  111.9600    10
    # f3()   32.704779   32.919150   38.865853   47.67281  134.2988    10
    # f4()    8.814958    8.823309    9.856188   19.66435  104.2827    10
    sum(f1() != f2())
    #[1] 0
    sum(f2() != f3())
    #[1] 0
    sum(f3() != f4())
    #[1] 0