Consider the following MWE in a text mining exercise, using R{tm}:
Toyota has several SUV models in the US.models<-c("highlander","land cruiser","rav4","sequoia","4runner")
. The general media refers to these not as "toyota rav4" (corpus already transformed to lower case) but as "rav4". To get a single column of toyota suvs in a DocumentTermMatrix, i need to convert all these brands into one generic "toyota_suv". What I am doing now is to repeat mycorpus<-tm_map(mycorpus, gsub, pattern="rav4", replacement="toyota_suv")
for length(models). A hack would be to set up model_names<-rep("toyota_suv",length(models))
and get on with life. How can I set up a dictionary with many-to-one mapping, so that all models
are replaced with 'toyota_suv' in one expression? Many thanks.
You can use a vectorized substitution function. The stringi
package offers such a function with the stri_replace_all
family of functions. Here, I'm using stri_replace_all_fixed
, but adjust case sensitivity and other options as needed.
library(tm)
library(stringi)
toyota_suvs <- c("highlander","land cruiser","rav4","sequoia","4runner")
tm_map(toyCorp, stri_replace_all_fixed,
pattern = toyota_suvs, replacement = "toyota_suv",
vectorize_all = FALSE)
data:
toyExample <- c("you don't know about the rav4, John Snow",
"the highlander is a great car",
"I want a land cruiser")
toyCorp <- Corpus(VectorSource(toyExample))