Search code examples
rtext-miningtm

R dictionary: create a many-to-one mapping


Consider the following MWE in a text mining exercise, using R{tm}: Toyota has several SUV models in the US.models<-c("highlander","land cruiser","rav4","sequoia","4runner"). The general media refers to these not as "toyota rav4" (corpus already transformed to lower case) but as "rav4". To get a single column of toyota suvs in a DocumentTermMatrix, i need to convert all these brands into one generic "toyota_suv". What I am doing now is to repeat mycorpus<-tm_map(mycorpus, gsub, pattern="rav4", replacement="toyota_suv") for length(models). A hack would be to set up model_names<-rep("toyota_suv",length(models)) and get on with life. How can I set up a dictionary with many-to-one mapping, so that all models are replaced with 'toyota_suv' in one expression? Many thanks.


Solution

  • You can use a vectorized substitution function. The stringi package offers such a function with the stri_replace_all family of functions. Here, I'm using stri_replace_all_fixed, but adjust case sensitivity and other options as needed.

    library(tm)
    library(stringi)
    
    toyota_suvs <- c("highlander","land cruiser","rav4","sequoia","4runner")
    
    tm_map(toyCorp, stri_replace_all_fixed,
        pattern = toyota_suvs, replacement = "toyota_suv",
        vectorize_all = FALSE)
    

    data:

    toyExample <- c("you don't know about the rav4, John Snow",
        "the highlander is a great car",
        "I want a land cruiser")
    
    toyCorp <- Corpus(VectorSource(toyExample))