Search code examples
rstringdata-cleaning

Remove all words after a specific word from a column


I'm looking to remove all words and characters that appear after specific words (including that word) in a column.

This is what my data looks like.

Number.of.Workers company_name
5195 82 valley ho hotels aka kings inn
5196 82 aluminum precision products
5197 79 levity of brea dba brea improv
5198 79 crunch
5199 71 comedy club of los angeles dba hollywood improv
5200 65 andre-boudin bakeries inc dba boudin

A specific example what I like to accomplish is that I like to remove any word after "aka" and "dba" and the words "aka" and "dba" from my data.

structure(list(Number.of.Workers = c("82", "82", "79", "79", 
"71", "65", "62", "58", "56", "53", "49"), company_name = c("valley ho hotels aka kings inn", 
"aluminum precision products", "levity of brea  dba brea improv", 
"crunch", "comedy club of los angeles  dba hollywood improv", 
"andre-boudin bakeries inc   dba boudin", "comedy club of san jose  dba san jose improv", 
"comedy club of brea  dba ontario improv", "sprout bost ", "culver west lp - playa provisions", 
"faa concord h dba concord honda")), row.names = 5195:5205, class = "data.frame")

.


Solution

  • You can use sub() as follows:

    df$company_name = sub("\\s+(aka|dba|\\(formerly.*[)])\\s+.*$", "", df$company_name)
    

    output:

         Number.of.Workers                      company_name
    5195                82                  valley ho hotels
    5196                82       aluminum precision products
    5197                79                    levity of brea
    5198                79                            crunch
    5199                71        comedy club of los angeles
    5200                65         andre-boudin bakeries inc
    5201                62           comedy club of san jose
    5202                58               comedy club of brea
    5203                56                      sprout bost 
    5204                53 culver west lp - playa provisions
    5205                49                     faa concord h
    

    Note: thanks to @Chris Ruehlemann for pointing out sub vs gsub. The difference is that the former replaces the first occurrence, while the latter replaces all.