Search code examples
rtmcorpus

Combine two words in a corpus using R


I'm trying to combine two words into one using the content_transform function as part of tm package in R.

For example, I've got location data and to create word clouds I need to combine "san jose", "san diego", "san francisco") otherwise "san" comes up as the most frequent word.

As far as I've gotten is creating a function, for example,

combineUK <- content_transformer(function(x, pattern)     
gsub(pattern,"UK",x,ignore.case = T))

However, creating functions for each town separately is unrealistic.

I was wondering whether there's any way I can implement the paste() function within content_transform?

So, perhaps I'm missing something obvious.


Solution

  • Since you did not provide a full reproducible example (copy-paste-run-able), I don't know what you got and what you want. However, consider for example

    library(tm)
    library(wordcloud)
    par(mfrow = c(2,1), cex=.5)
    txt <- c("hello san jose dudes", "welcome to san diego", "Did you like san francisco")
    corp <- Corpus(VectorSource(txt))
    wordcloud(corp, min.freq=1)
    corp <- tm_map(corp, content_transformer(function(x) gsub("(san).(\\w+)", "\\1\\2", x, ignore.case = TRUE)))
    wordcloud(corp, min.freq=1)
    

    enter image description here