I am trying to combine a few words so that they count as one.
In this example I want val
and valuatin
to be counted as valuation
.
The code I have been using to try and do this is below:
#load in package
library(tm)
replaceWords <- function(x, from, keep){
regex_pat <- paste(from, collapse = "|")
gsub(regex_pat, keep, x)
}
oldwords <- c("val", "valuati")
newword <- c("valuation")
TextDoc2 <- tm_map(TextDoc, replaceWords, from=oldwords, keep=newword)
However this does not work as expected. Any time there is val
in a word it is now being replaced with valuation
. For example equivalent
becomes equivaluation
. How do I get around this error and achieved my desired result?
Try this function -
replaceWords <- function(x, from, keep){
regex_pat <- sprintf('\\b(%s)\\b', paste(from, collapse = '|'))
gsub(regex_pat, keep, x)
}
val
matches with equivalent
. Adding word boundaries stop that from happening.
grepl('val', 'equivalent')
#[1] TRUE
grepl('\\bval\\b', 'equivalent')
#[1] FALSE