I can easily remove stop words using the tm package but is there an easy way to remove specific phrases? I'd like to be able to remove the phrase, "good morning" but not remove cases where good is not followed by morning.
Example:
x <- "Good morning. Great question...I'd say we had a good time."
doc.vec <- VectorSource(x)
doc.corpus <- Corpus(doc.vec)
doc.corpus <- tm_map(doc.corpus, stripWhitespace)
doc.corpus <- tm_map(doc.corpus, removePunctuation)
doc.corpus <- tm_map(doc.corpus, content_transformer(tolower))
doc.corpus <- tm_map(doc.corpus, removeWords, c(stopwords("english"), "good"))
dtm <- DocumentTermMatrix(doc.corpus, control=list())
inspect(dtm)
Just add "good morning" to the list of words to be removed.
doc.corpus <- tm_map(doc.corpus, removeWords, c(stopwords("english"), "good morning"))
if you inspect the dtm you will see that you have only 1 "good" left and no "morning"