I'm looking to remove all words and characters that appear after specific words (including that word) in a column.
This is what my data looks like.
Number.of.Workers | company_name | |
---|---|---|
5195 | 82 | valley ho hotels aka kings inn |
5196 | 82 | aluminum precision products |
5197 | 79 | levity of brea dba brea improv |
5198 | 79 | crunch |
5199 | 71 | comedy club of los angeles dba hollywood improv |
5200 | 65 | andre-boudin bakeries inc dba boudin |
A specific example what I like to accomplish is that I like to remove any word after "aka" and "dba" and the words "aka" and "dba" from my data.
structure(list(Number.of.Workers = c("82", "82", "79", "79",
"71", "65", "62", "58", "56", "53", "49"), company_name = c("valley ho hotels aka kings inn",
"aluminum precision products", "levity of brea dba brea improv",
"crunch", "comedy club of los angeles dba hollywood improv",
"andre-boudin bakeries inc dba boudin", "comedy club of san jose dba san jose improv",
"comedy club of brea dba ontario improv", "sprout bost ", "culver west lp - playa provisions",
"faa concord h dba concord honda")), row.names = 5195:5205, class = "data.frame")
.
You can use sub()
as follows:
df$company_name = sub("\\s+(aka|dba|\\(formerly.*[)])\\s+.*$", "", df$company_name)
output:
Number.of.Workers company_name
5195 82 valley ho hotels
5196 82 aluminum precision products
5197 79 levity of brea
5198 79 crunch
5199 71 comedy club of los angeles
5200 65 andre-boudin bakeries inc
5201 62 comedy club of san jose
5202 58 comedy club of brea
5203 56 sprout bost
5204 53 culver west lp - playa provisions
5205 49 faa concord h
Note: thanks to @Chris Ruehlemann for pointing out sub
vs gsub
. The difference is that the former replaces the first occurrence, while the latter replaces all.