Search code examples
rapplyradix

Convert character vector to factor and set levels using apply


I have written a function that creates a new data.frame with character vectors. I would like to convert those character vectors to factors using a the original data.frame object that has the same variable names and labels.

I have written a for loop that works but I want to use apply since it is faster. The reason I need to do this is because the user specifies which variables are used in the creation of this new data.frame and how many there are.

Here is the for loop:

for x in seq_along(group_labs) {
  new[group_labs[x]] <- factor(
    new[group_labs[x]],
    levels = levels(old[group_labs[x]]
  )
}

Here is some fake data to show that this works

# create fake data
new <- data.frame(
  var1 = c("a", "c", "b", "c", "b", "a"),
  var2 = c("z", "y", "y", "z", "x", "y"),
  var3 = c(1, 3, 4, 2, 1, 4),
  stringsAsFactors = FALSE
)

# recreate the fake data as the old data
old <- new

# make var1 and var2 in the old one factors manually
old$var1 <- factor(new$var1, levels = c("a", "b", "c"))
old$var2 <- factor(new$var2, levels = c("x", "y", "z"))

# set the reference variables
group_labs = c("var1", "var2")

# run for loop that automatically converts the variables to factors
for (x in seq_along(group_labs)) {
  new[group_labs[x]] <- factor(
    new[group_labs[x]],
    levels = levels(old[group_labs[x]])
  )
}
# check it worked
class(new$var1)
# check it worked
class(new$var2)

Is there a way to use apply instead of a for loop?

Thanks in advance!


Solution

  • The lapply approach in your answer can be further improved. First, better put names and levels together in a named list. Next loop over the names rather than the elements themselves. Include a check with warning, to prevent columns from being added unnoticed, only if name exists as column in new, create factor with levels stored in the list element.

    > factorizer <- \(df, lst) {
    +   lapply(names(lst) |> setNames(nm=_), \(x) {
    +     if (!x %in% names(df)) {
    +       warning(sprintf('%s not found', sQuote(x)), call.=FALSE)
    +       NULL
    +     } else {
    +       factor(df[[x]], lst[[x]])
    +     }
    +   })
    + }
    >
    > factorizer(new, fct_data)
    $var1
    [1] a c b c b a
    Levels: c b a
    
    $var2
    [1] z y y z x y
    Levels: x y z
    
    $foo
    NULL
    
    Warning message:
    ‘foo’ not found
    

    foo=1:3—representing some sort of typo or otherwise non existing column—does not affect the result. Although the function returns an element “foo”, it is not assigned to the original data because it is NULL.

    Usage

    > fct_data <- list(var1=c("c", "b", "a"), var2=c('x', 'y', 'z'), foo=1:3)
    > 
    > new[names(fct_data)] <- factorizer(new, fct_data)
    Warning message:
    ‘foo’ not found 
    

    where

    > str(new)
    'data.frame':   6 obs. of  3 variables:
     $ var1: Factor w/ 3 levels "c","b","a": 3 1 2 1 2 3
     $ var2: Factor w/ 3 levels "x","y","z": 3 2 2 3 1 2
     $ var3: num  1 3 4 2 1 4
    

    Data:

    > dput(new)
    structure(list(var1 = c("a", "c", "b", "c", "b", "a"), var2 = c("z", 
    "y", "y", "z", "x", "y"), var3 = c(1, 3, 4, 2, 1, 4)), class = "data.frame", row.names = c(NA, 
    -6L))