I'm looking to remove specific words (for example "co" "INC" etc) from a column in data without removing the same letters from other words in the same column. In other words, I only want to remove these words when they are free standing.
This is a an example of what a few rows and columns of the company_name data looks like:
State | Company Name | number of workers |
---|---|---|
x | COLGATE-PALMOLIVE CO. | 10 |
y | OLD COPPER CO INC | 77 |
z | NIKE INC -CL B | 5 |
r | COMMERCIAL METALS | 23 |
w | CARNIVAL CORPORATION & PLC | 89 |
I used the following code to remove the words:
remove <- company_name %>%
mutate(Company_Name = str_remove_all(Company_Name, "-CL B|INC|CORP|CO|CO."))
What I got in return was not what was expecting. For example in the case of the company "CARNIVAL CORPORATION & PLC" I got "CARNIVAL ORATION & PLC" back where the "CO was removed from the beginning was "CORPORATION"
What I would like to achieve is for the words to be removed only if they are full words on their own. I also tried including spaces before and after the words in the code as follows here:
remove <- company_name %>%
mutate(Company_Name = str_remove_all(Company_Name, " -CL B | INC | CORP | CO | CO. "))
But I still don't get the results I'm looking for.
I think something like this will work:
strings_to_remove <- c("-CL","B","INC","CORP","CO")
regex<-paste(paste0("(^|\\s+)", strings_to_remove, "\\.?", "(?=\\s+|$)"),collapse="|")
remove <- company_name %>%
mutate(Company_Name = str_remove_all(Company_Name, regex))
where
"(^|\\s+)"
matches the beginning of the string (^
or whitespace) before the string to remove.
"\\.?"
matches an optional period
"(?=\\s+|$)")
matches more whitespace or the end of the string.
(This answer assumes that you also want to remove "INC." and "CORP." though this wasn't specified in your question).