Search code examples
rstringdata-cleaning

Remove specific words from a column r


I'm looking to remove specific words (for example "co" "INC" etc) from a column in data without removing the same letters from other words in the same column. In other words, I only want to remove these words when they are free standing.

This is a an example of what a few rows and columns of the company_name data looks like:

State Company Name number of workers
x COLGATE-PALMOLIVE CO. 10
y OLD COPPER CO INC 77
z NIKE INC -CL B 5
r COMMERCIAL METALS 23
w CARNIVAL CORPORATION & PLC 89

I used the following code to remove the words:

remove <- company_name %>% 
  mutate(Company_Name = str_remove_all(Company_Name, "-CL B|INC|CORP|CO|CO."))

What I got in return was not what was expecting. For example in the case of the company "CARNIVAL CORPORATION & PLC" I got "CARNIVAL ORATION & PLC" back where the "CO was removed from the beginning was "CORPORATION"

What I would like to achieve is for the words to be removed only if they are full words on their own. I also tried including spaces before and after the words in the code as follows here:

remove <- company_name %>% 
  mutate(Company_Name = str_remove_all(Company_Name, " -CL B | INC | CORP | CO  | CO. "))

But I still don't get the results I'm looking for.


Solution

  • I think something like this will work:

    strings_to_remove <- c("-CL","B","INC","CORP","CO")
    regex<-paste(paste0("(^|\\s+)", strings_to_remove, "\\.?", "(?=\\s+|$)"),collapse="|")
    remove <- company_name %>% 
      mutate(Company_Name = str_remove_all(Company_Name, regex))
    

    where "(^|\\s+)" matches the beginning of the string (^ or whitespace) before the string to remove.

    "\\.?" matches an optional period

    "(?=\\s+|$)") matches more whitespace or the end of the string.

    (This answer assumes that you also want to remove "INC." and "CORP." though this wasn't specified in your question).