I have a dataframe with a column of company names. I want to create a new column that is a fuzzy/canonicalized version of the name (perhaps using regex to strip suffixes like "corporation, "inc", and "llc" and prefixes like "the").
name <- c("Microsoft", "Apple, Inc.", "Youtube, LLC", "Huffington Post")
companies <- data.frame(name)
I want company$canonicalized_name to return
"microsoft", "apple", "youtube", "huffington post"
How can I write this regex pattern in R?
I don't know what rules should apply to normalize your data but if you just want to (a) delete everything following a comma and then convert the string to lower case (as you do in your example), you can e.g. do this using
name <- c("Microsoft", "Apple, Inc.", "Youtube, LLC", "Huffington Post")
companies <- data.frame(name) %>%
dplyr::mutate(canonicalized_name = stringr::str_replace(name, ",.*", "") %>% tolower)
# name canonicalized_name
# 1 Microsoft microsoft
# 2 Apple, Inc. apple
# 3 Youtube, LLC youtube
# 4 Huffington Post huffington post