I have a messy factor variable with more levels than it should have. The cases come from an open survey and many participants wrote with typos or just responded a similar answer in different ways.
This is a sample df that represent my problem:
df <- data.frame(ID=seq(1:10),
Nationality=c("espanol", "spaniol", "ESPANOL",
"spanish", "colombia", "Colombian",
"British", "brit", "ESPanol", "UK")
)
The output I would like is this:
> df
ID Nationality
1 1 Spanish
2 2 Spanish
3 3 Spanish
4 4 Spanish
5 5 Colombian
6 6 Colombian
7 7 British
8 8 British
9 9 Spanish
10 10 British
This is what I have tried to do in order to reduce this 10 artificial levels of the factor to just 3 (Spanish, Colombian, British) as it should be:
library(forcats)
levels(df$Nationality) <- fct_collapse(df$Nationality, Spanish = c("espanol", "spaniol", "ESPANOL",
"spanish", "ESPanol"),
Colombian = c("colombia", "Colombian"),
British = c("British", "brit", "UK")
)
This effectively reduces my "nationality" factor to 3 levels, but the output is looks like this and does not correspond to anything similar to the first one:
> df
ID Nationality
1 1 Colombian
2 2 British
3 3 British
4 4 Spanish
5 5 Spanish
6 6 Spanish
7 7 Spanish
8 8 Spanish
9 9 Colombian
10 10 British
In the bigger dataset I am working with, it does not work either, but the output is worse in the sense that all cases become "Spanish" and I have no single cue about why this could be happening.
Thanks in advance for any help! Best, Lucas
Have you tried making Nationality a factor first?
df <- data.frame(ID=seq(1:10),
Nationality=c("espanol", "spaniol", "ESPANOL",
"spanish", "colombia", "Colombian",
"British", "brit", "ESPanol", "UK")
)
library(forcats)
df2 <- df %>%
mutate(Nationality = factor(Nationality)) %>%
mutate(Nationality = fct_collapse(Nationality, Spanish = c("espanol", "spaniol", "ESPANOL", "spanish", "ESPanol"),
Colombian = c("colombia", "Colombian"),
British = c("British", "brit", "UK")))
#more concise
mutate(across(Nationality, ~ fct_collapse(factor(.),
Spanish = c("espanol", "spaniol", "ESPANOL", "spanish", "ESPanol"),
Colombian = c("colombia", "Colombian"),
British = c("British", "brit", "UK")
)))