Search code examples
rr-factor

R - Obtain the connection between the numeric values and the level labels in a factor


I'm struggling to find the connection between numeric (integer) values that exist in a R factor object and its level labels. I know how to define the levels and the labels. But let's assume I get an unfamiliar data set in which I'll find several factors (here: sex & color):

test <- data.frame(
                   factor(c(1,2,1,1,2,2,1),
                          levels= c(1,2),
                          labels = c("female", "male")
                          ),
                   factor(c(3,2,2,1,4,4,5),
                          levels= c(1,2,3,4,5),
                          labels= c("red", "green", "blue", "yellow", "brown")
                          )
                  )

names(test) <- c("sex", "color")
test

      sex  color
 1 female   blue
 2   male  green
 3 female  green
 4 female    red
 5   male yellow
 6   male yellow
 7 female  brown

I will be able to obtain the level labels by using attributes() and I will be able to obtain the numeric values e.g. by using test$sex <- as.numeric(test$sex)
But how do I know, that 1 equals female and 2 equals male? Same thing (even worse) for the colors. How do I establish the connection?

Thanks


Solution

  • As others have said, the integer value simply increments along the length of the levels. Personally, I find this easiest to visualize in a reference table.

    test <- data.frame(
      sex = factor(c(1,2,1,1,2,2,1),
                   levels= c(1,2),
                   labels = c("female", "male")
      ),
      color = factor(c(3,2,2,1,4,4,5),
                    levels= c(1,2,3,4,5),
                    labels= c("red", "green", "blue", "yellow", "brown")
      )
    )
    
    # Make a reference table
    data.frame(level = seq_along(levels(test$color)),
               label = levels(test$color))
    
      level  label
    1     1    red
    2     2  green
    3     3   blue
    4     4 yellow
    5     5  brown
    

    If you want to get the references for all of the factors in a data frame, you can vectorize the code:

    factor_reference <- function(data)
    {
      Ref <- 
        lapply(data,
               function(x)
               {
                 if (is.factor(x)) data.frame(level = seq_along(levels(x)),
                                              label = levels(x))
                 else NULL
               }
        )
    
      Ref[!vapply(Ref, is.null, logical(1))]
    }
    
    factor_reference(test)
    $sex
      level  label
    1     1 female
    2     2   male
    
    $color
      level  label
    1     1    red
    2     2  green
    3     3   blue
    4     4 yellow
    5     5  brown