Search code examples
rforcats

Changing factor levels-- unknown levels in "f" -- can't change levels


I have a factor containing many industry names. I need to collapse them into major categories and industries. For example, because I allowed respondents to respond with whatever they want, I have an inflated number of levels (e.g. financial services, Financial Services, Banking, Finance). Because these cases don't match, they come out as an additional level, so I'm trying to collapse them with forcats:

test <- fct_collapse(PrescreenF$Industry, Finance = c("Banking",
  "Corporate Finance", "Finance", "Financial", "financial services",
  "financial services", "Financial Services", "Financial services"),
  NULL = "H")

I get a warning that says: "Financial services" is unknown. This is extremely frustrating because when I call up the vector, I can see that it does exist. I've tried copying and pasting the exact words from the call, re-writing it and it just seems like there are hidden characters that prevent it from being changed.

How do I properly collapse these values?

-> test$industry
Banking
Corporate Finance 
Finance Financial 
financial services
financial services 
Financial Services 
Financial services

When I go to "revalue" say, the last level, "Financial services", it tells me its an unknown string.

EDIT output of dput(x$industry)

structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 
4L, 3L, 3L, 3L, 5L, 7L, 8L, 9L, 10L, 11L, 12L, 12L, 13L, 14L, 
15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 
16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 
16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 17L, 18L, 18L, 18L, 
18L, 19L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 25L, 26L, 27L, 28L
), .Label = c("", "{\"ImportId\":\"QID8_TEXT\"}", "Finance", 
"Financial ", "Financial services ", "Please indicate the industry you work in (e.g. technology, healthcare etc):", 
"Cleantech", "Delivery", "e-commerce/fashion", "Food", "Food & Bev", 
"Retail", "Service", "tech", "technology", "Technology", "IT, technology", 
"Software", "Technology ", "Tehcnology", "Consulting", "Digital advertising", 
"Education", "Higher education", "Technology, management consulting", 
"University professor; teaching, research and service", "Information Technology and Services", 
"mobile technology"), class = "factor")

EDIT Figured it out. Some of the terms had an extra space after they ended. For example, although when I called Prescreen$Industry, it would return a number of names like "Banking" and "Corporate Finance", it didn't tell me that there was a space after the level. Banking was actually.. "Banking " with an invisible space that didn't show up in R. How does one go about making sure this is visible and doesn't happen again?

Can I run a len function within a column? If so, how does that work? PrescreenF$Industry("Banking")?


Solution

  • If "x" is your dataframe

    library(stringr)
    
    x$industry <- as.character(x$industry)
    x$industry <- str_trim(x$industry)
    x$industry <- as.factor(x$industry)
    

    Then you can get back to fct_collapse() to simplify your factors.