Search code examples
rcharacterfactorslevels

Collapsing multiple factor levels of (messy) character variable in R


I struggle to collapse multiple factor levels into only three factor levels of one specific variable in R Studio.

My point of departure is a data.table with 250 variables and roughly 4,000 rows. For one factor variable I want to collpase it's 75 levels into 3 levels. Moreover, of the 75 levels, 4 levels should be ignored (or set to NA before) since they include controversial information. This factor variable is based on survey answers that also include individual answers in text format. Sometimes even the language differs. So, it's a bit messy.

I tried to collapse these 75 levels (or 71 levels if respective observations set to NA before) into 3 in two different ways. However, R always returns a + instead of a > in the console and I can't continue to perform any other commands. Of course I can stop this by hitting Esc but this does not help me receiving my desired result.

So, this imaginary example should show what I tried:

1) using the levels and list functions

levels(dt$x) <- list("No"=c("I don't allow anything", "..."), 
"Yes"= c("Number of visitors ,annual sales, sales growth, number of customers", "Net sales", "..."), 
"Maybe"=c("The CEO's approval is needed.", "To be discussed"))

2) using the forcats package

dt$x %>%
fct_collapse(No= c("I don't allow anything", "..."), 
Yes= c("Number of visitors ,annual sales, sales growth", "number of customers", "Net sales", "..."), 
Maybe=c("The CEO's approval is needed.", "To be discussed"))

I assume the problem arises due to how the original variable is structured. Does anyone have an idea how I could address that?

A big thank you upfront!

Best, Ilka


Solution

  • A friend of mine actually provided the answer. It's nothing to do with the data structure.

    This does the job:

    dt$x <- fct_collapse(dt$x, 
                              No = c(
                                "I don't allow anything", 
                                 "..."),
                              Yes= c(
                                 "Number of visitors ,annual sales, sales growth",
                                 "number of customers", 
                                 "Net sales", 
                                 "..."),
                              Maybe= c(
                                  "The CEO's approval is needed.", 
                                  "To be discussed")
                                   )
    

    I still don't know why the first option I posted above doesn't work though (it did perfectly well with another variable).