Search code examples
rregexfuzzy-comparison

How do I use regex in R to create a new column of canonicalized company names?


I have a dataframe with a column of company names. I want to create a new column that is a fuzzy/canonicalized version of the name (perhaps using regex to strip suffixes like "corporation, "inc", and "llc" and prefixes like "the").

name <- c("Microsoft", "Apple, Inc.", "Youtube, LLC", "Huffington Post")
companies <- data.frame(name)

I want company$canonicalized_name to return

"microsoft", "apple", "youtube", "huffington post"

How can I write this regex pattern in R?


Solution

  • I don't know what rules should apply to normalize your data but if you just want to (a) delete everything following a comma and then convert the string to lower case (as you do in your example), you can e.g. do this using

    library(dplyr)
    library(stringr)
    name <- c("Microsoft", "Apple, Inc.", "Youtube, LLC", "Huffington Post")
    companies <- data.frame(name) %>%
            dplyr::mutate(canonicalized_name = stringr::str_replace(name, ",.*", "") %>% tolower)
    
    companies
    #              name canonicalized_name
    # 1       Microsoft          microsoft
    # 2     Apple, Inc.              apple
    # 3    Youtube, LLC            youtube
    # 4 Huffington Post    huffington post