I want to stem each word. For example, 'hardworking employees
' should be converted to 'hardwork employee
' not 'hardworking employee
'. In simple words, it should stem both words separately. I know it does not make sense. But it's such an example. In reality, I have medical words in which this kind of stemming makes sense.
I have function which considers words using delimeter ',' and then performs stemming. I want it to be modified so that stemming can be performed on all the words within ',' delimeter.
dt = read.table(header = TRUE,
text ="Word Synonyms
employee 'hardworking employees, intelligent employees, employment, employee'
lover 'loved ones, loving boy, lover'
", stringsAsFactors= F)
library(SnowballC)
library(parallel)
stem_text3<- function(text, language = "english", mc.cores = 3) {
stem_string <- function(str, language) {
str <- strsplit(x = str, split = "\\,")
str <- wordStem(unlist(str), language = language)
str <- paste(str, collapse = ",")
return(str)
}
# stem each text block in turn
x <- mclapply(X = text, FUN = stem_string, language)
# return stemed text blocks
return(unlist(x))
}
df000 <- data.frame(stringsAsFactors = F)
for (i in 1:nrow(dt)){
sent = dt[i, "Synonyms"]
k = data.frame(r_synonyms = stem_text3(sent, language = 'en'), stringsAsFactors = F)
df000= rbind(df000,k)
}
It's tricky because SnowballC::wordStem()
stems each element of a character vector, and your character vectors therefore need to be split and recombined to use that.
I'd dispense with the loops and use apply operations to vectorize it (and you could swap that for mclapply()
.
library("stringi")
dt[["Synonyms"]] <-
sapply(stri_split_fixed(dt[["Synonyms"]], ","), function(x) {
x <- lapply(stri_split_fixed(stri_trim_both(x), " "), function(y) {
paste(SnowballC::wordStem(y), collapse = " ")
})
paste(x, collapse = ", ")
})
dt
## Word Synonyms
## 1 employee hardwork employe, intellig employe, employ, employe
## 2 lover love on, love boi, lover
Notes:
First, this is not what you were expecting for stems, I think, but that's how the Porter stemmer works as implemented in SnowballC.
Second, there are better ways to structure this problem overall, but I cannot really answer that unless you explain your objective in asking this question. To replace a set of phrases (with wildcarding that might substitute for stemming), for instance, in quanteda you could do the following:
library("quanteda")
thedict <- dictionary(list(
employee = c("hardwork* employ*", "intellig* employ*", "employment", "employee*"),
lover = c("lov* ones", "lov* boy", "lover*")
))
tokens("Some employees are hardworking employees in useful employment.
They support loved osuch as their wives and lovers.") %>%
tokens_lookup(dictionary = thedict, exclusive = FALSE, capkeys = FALSE)
## tokens from 1 document.
## text1 :
## [1] "Some" "employee" "are" "employee" "in" "useful" "employee"
## [8] "." "They" "support" "loved" "osuch" "as" "their"
## [15] "wives" "and" "lover" "."