Search code examples
rr-factor

Problem collapsing levels of a factor in R


I have a messy factor variable with more levels than it should have. The cases come from an open survey and many participants wrote with typos or just responded a similar answer in different ways.

This is a sample df that represent my problem:


df <- data.frame(ID=seq(1:10),
                 Nationality=c("espanol", "spaniol", "ESPANOL",
                               "spanish", "colombia", "Colombian",
                               "British", "brit", "ESPanol", "UK")
                               )

The output I would like is this:

> df
   ID Nationality
1   1     Spanish
2   2     Spanish
3   3     Spanish
4   4     Spanish
5   5   Colombian
6   6   Colombian
7   7     British
8   8     British
9   9     Spanish
10 10     British

This is what I have tried to do in order to reduce this 10 artificial levels of the factor to just 3 (Spanish, Colombian, British) as it should be:

library(forcats) 
                              
levels(df$Nationality) <- fct_collapse(df$Nationality, Spanish = c("espanol", "spaniol", "ESPANOL",
                                                                  "spanish", "ESPanol"),
                                                       Colombian = c("colombia", "Colombian"),
                                                       British = c("British", "brit", "UK")
                                        )

This effectively reduces my "nationality" factor to 3 levels, but the output is looks like this and does not correspond to anything similar to the first one:

> df
   ID Nationality
1   1   Colombian
2   2     British
3   3     British
4   4     Spanish
5   5     Spanish
6   6     Spanish
7   7     Spanish
8   8     Spanish
9   9   Colombian
10 10     British

In the bigger dataset I am working with, it does not work either, but the output is worse in the sense that all cases become "Spanish" and I have no single cue about why this could be happening.

Thanks in advance for any help! Best, Lucas


Solution

  • Have you tried making Nationality a factor first?

    df <- data.frame(ID=seq(1:10),
                     Nationality=c("espanol", "spaniol", "ESPANOL",
                                   "spanish", "colombia", "Colombian",
                                   "British", "brit", "ESPanol", "UK")
    )
    library(forcats) 
    
    
    df2 <- df %>% 
      mutate(Nationality = factor(Nationality)) %>% 
     mutate(Nationality = fct_collapse(Nationality, Spanish = c("espanol", "spaniol", "ESPANOL", "spanish", "ESPanol"),
                                           Colombian = c("colombia", "Colombian"),
                                           British = c("British", "brit", "UK")))
    
    
    
    #more concise
    
    mutate(across(Nationality, ~ fct_collapse(factor(.), 
    Spanish = c("espanol", "spaniol", "ESPANOL", "spanish", "ESPanol"), 
    Colombian = c("colombia", "Colombian"), 
    British = c("British", "brit", "UK")
    )))