Search code examples
rdplyrrecode

What is the easiest way to group/recode multiple categories into few categories?


I have a column with almost a 100 string categories that I would like to group/recode into fewer categories. I am trying to figure out the easiest way to do so, I thought about turning it into factor or numeric to make it easier to make operations. They are not in any particular order, but I can't seem to find the best way to recode it. Here is an example:

Suppose I have 15 string categories:

cat1 <- LETTERS[seq(1,15)]
df <- as.data.frame(cat1)

I turned it into numeric:

df$cat2 <- as.numeric(as.factor(df$cat1))

This is what I tried to do:

df <- df %>% mutate(cat3 = case_when(cat2 == c(1:5,7,9) ~ 1,
                                     cat2 == c(6,8,10,13) ~ 2,
                                     cat2 == (11:12,14:15) ~ 3))

Or I even tried:

df$cat3[df$cat2 == c(1:5, 7,9)] <- 1

I tried other codes, but they don't seem to work. Suppose I want to group the following new categories:

(1:5, 7,9) (6,8,10,13) (11:12,14:15)

What is the best way to do it?


Solution

  • Your case_when syntax needs a little tweak to make it work:

    df %>% mutate(cat3 = case_when(cat2 %in% c(1:5, 7, 9) ~ 1,
                                   cat2 %in% c(6,8,10,13) ~ 2,
                                   cat2 %in% c(11:12,14:15) ~ 3))
    

    But you can also use the one vector version, case_match:

    df %>% mutate(cat3 = case_match(cat2, 
                                    c(1:5, 7, 9) ~ 1,
                                    c(6,8,10,13) ~ 2,
                                    c(11:12,14:15) ~ 3))