Search code examples
rfunctionlabelfactorspmap

How to automate adding factors to variables in large data frame in R


I have a large data frame in R with over 200 mostly character variables that I would like to add factors for. I have prepared all levels and labels in an separate data frame. For a certain variable Var1, the corresponding levels and labels are Var1_v and Var1_b, for example for the variable Gender the levels and labels are named Gender_v and Gender_l.

Here is an example of my data:

df <- data.frame (Gender = c("2","2","1","2"),
                  AgeG = c("3","1","4","2"))

fct <- data.frame (Gender_v  = c("1", "2"),
                  Gender_b = c("Male", "Female"),
                  AgeG_v = c("1","2","3","4"),
                  AgeG_b = c("<25","25-60","65-80",">80"))

df$Gender <- factor(df$Gender, levels = fct$Gender_v, labels = fct$Gender_b, exclude = NULL)
df$AgeG <- factor(df$AgeG, levels = fct$AgeG_v, labels = fct$AgeG_b, exclude = NULL)

Is there away to automatize the process, so that the factors (levels and labels) are applied to corresponding variables without having me doing every single one individually? I think it's done through a function probebly with pmap.

My goal is minimize the effort needed for this process. Is there a better way to prepare the labels and levels as well?

Help is much appreciated.


Solution

  • A data frame is not really an appropriate data structure for storing the factor level definitions in: there’s no reason to expect all factors to have an equal amount of levels. Rather, I’d just use a plain list, and store the level information more compactly as named vectors, along these lines:

    df <- data.frame(
      Gender = c("2", "2", "1", "2"),
      AgeG = c("3", "1", "4", "2")
    )
    
    value_labels <- list(
      Gender = c("Male" = 1, "Female" = 2),
      AgeG = c("<25" = 1, "25-60" = 2, "65-80" = 3, ">80" = 4)
    )
    

    Then you can make a function that uses that data structure to make factors in a data frame:

    make_factors <- function(data, value_labels) {
      for (var in names(value_labels)) {
        if (var %in% colnames(data)) {
          vl <- value_labels[[var]]
          data[[var]] <- factor(
            data[[var]],
            levels = unname(vl),
            labels = names(vl)
          )
        }
      }
      data
    }
    
    make_factors(df, value_labels)
    #>   Gender  AgeG
    #> 1 Female 65-80
    #> 2 Female   <25
    #> 3   Male   >80
    #> 4 Female 25-60