Search code examples
rdataframetidyversedplyrlevels

How to make new dataframe columns from factor levels (& troubleshoot mutate error)


My searches on SO & elsewhere are coming up with interesting solutions to problems that have similar search terms but not my issue. Thought I found a solution, but the error is leaving me quite puzzled. I'm trying to learn tidyverse approaches better, but I appreciate any solution strategies.

Aim: Create new vector columns in a dataframe where each new vector is named from the factor level of an existing dataframe vector. The code solution should be dynamic so that it can be applied to factors with any number of levels.

Test data

df <- data.frame(x=c(1:5), y=letters[1:5])

Which produces as expected

> str(df)
'data.frame':   5 obs. of  2 variables:
 $ x: int  1 2 3 4 5
 $ y: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
> df
  x y
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e

and when finished should look like

> df
  x y  a  b  c  d  e
1 1 a NA NA NA NA NA
2 2 b NA NA NA NA NA
3 3 c NA NA NA NA NA
4 4 d NA NA NA NA NA
5 5 e NA NA NA NA NA

Tidy for loop approach

library(tidyverse)

for (i in 1:length(levels(df$y))) {
  df <- mutate(df, levels(df$y)[i] = NA)
}

but that gives me the following error:

> for (i in 1:length(levels(df$y))) {
+   df <- mutate(df, levels(df$y)[i] = NA)
Error: unexpected '=' in:
"for (i in 1:length(levels(df$y))) {
  df <- mutate(df, levels(df$y)[i] ="
> }
Error: unexpected '}' in "}"

Troubleshooting, I removed the loop and simplified the mutate to see if it works in general, which it will with or without the quotation marks (note, I reran the test data to start fresh).

levels(df$y)[1]
> "a"

df <- mutate(df, a = NA)
df <- mutate(df, "a" = NA) # works the same as the previous line
> df
  x y  a
1 1 a NA
2 2 b NA
3 3 c NA
4 4 d NA
5 5 e NA

Substituting the levels function back in, but without the loop returns the mutate error (note, I reran the test data to start fresh):

> df <- mutate(df, levels(df$y)[1] = NA)
Error: unexpected '=' in "df <- mutate(df, levels(df$y)[1] ="

I continue to get the same error is I try to use .data=df to specify the dataset or wrap as.character(), paste(), or paste0() around the levels function--which I picked up other various solutions online. Nor is R just being picky if I restructure the code using the %>% pipe.

What about the equal sign is unexpected with my levels code substitution (and potential newb mistakes)? Any assistance is greatly appreciated!


Solution

  • Posting solutions for others based on comments received, and so I can mark this question as solved. Please give up votes to @arg0naut91 and @Gregor for their solutions & guided help.

    Test data

    df <- data.frame(x=c(1:5), y=letters[1:5])
    

    Solution 1: base R

    @arg0naut91 provided an elegant base R solution:

    df[, levels(df$y)] <- NA
    df
      x y  a  b  c  d  e
    1 1 a NA NA NA NA NA
    2 2 b NA NA NA NA NA
    3 3 c NA NA NA NA NA
    4 4 d NA NA NA NA NA
    5 5 e NA NA NA NA NA
    

    Solution 2: using quo() and :=

    @Gregor's guidance & useful links showed how some functions, and pretty much all of the tidyverse, does not evaluate objects as we might expect.

    First test with a single new column:

    df <- data.frame(x=c(1:5), y=letters[1:5]) # refresh test data
    
    varlevel <- levels(df$y)[1] # where level 1=a
    df <- mutate(df, !!varlevel := NA)
    rm(varlevel) # cleanup
    df
      x y  a
    1 1 a NA
    2 2 b NA
    3 3 c NA
    4 4 d NA
    5 5 e NA
    

    Then put it into the for loop to capture each factor level as a new column:

    df <- data.frame(x=c(1:5), y=letters[1:5]) # refresh test data
    
    for (i in 1:length(levels(df$y))) {
    +   varlevel <- levels(df$y)[i]
    +   df <- mutate(df, !!varlevel := NA)
    +   rm(varlevel) # cleanup
    +   }
    df
      x y  a  b  c  d  e
    1 1 a NA NA NA NA NA
    2 2 b NA NA NA NA NA
    3 3 c NA NA NA NA NA
    4 4 d NA NA NA NA NA
    5 5 e NA NA NA NA NA