Package corpus
provides a custom stemming function. The stemming function should, when given a term as an input, return the stem of the term as the output.
From Stemming Words I taken the following example, that uses the hunspell
dictionary to do the stemming.
First I define the sentences on which to test this function:
sentences<-c("The color blue neutralizes orange yellow reflections.",
"Zod stabbed me with blue Kryptonite.",
"Because blue is your favourite colour.",
"Red is wrong, blue is right.",
"You and I are going to yellowstone.",
"Van Gogh looked for some yellow at sunset.",
"You ruined my beautiful green dress.",
"You do not agree.",
"There's nothing wrong with green.")
The custom stemming function is:
stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell::hunspell_stem(term)[[1]]
if (length(stems) == 0) { # if there are no stems, use the original term
stem <- term
} else { # if there are multiple stems, use the last one
stem <- stems[[length(stems)]]
}
stem
}
This code
sentences=text_tokens(sentences, stemmer = stem_hunspell)
produces:
> sentences
[[1]]
[1] "the" "color" "blue" "neutralize" "orange" "yellow"
[7] "reflection" "."
[[2]]
[1] "zod" "stabbed" "me" "with" "blue" "kryptonite"
[7] "."
[[3]]
[1] "because" "blue" "i" "your" "favourite" "colour"
[7] "."
[[4]]
[1] "re" "i" "wrong" "," "blue" "i" "right" "."
[[5]]
[1] "you" "and" "i" "are" "go"
[6] "to" "yellowstone" "."
[[6]]
[1] "van" "gogh" "look" "for" "some" "yellow" "at" "sunset" "."
[[7]]
[1] "you" "ruin" "my" "beautiful" "green" "dress"
[7] "."
[[8]]
[1] "you" "do" "not" "agree" "."
[[9]]
[1] "there" "nothing" "wrong" "with" "green" "."
After stemming I would like to apply other operations on the text, e.g. removing stop words. Anyway, when I applied the tm
-function:
removeWords(sentences,stopwords)
to my sentences, I obtained the following error:
Error in UseMethod("removeWords", x) :
no applicable method for 'removeWords' applied to an object of class "list"
If I use
unlist(sentences)
I don't get the desired result, because I end up with a chr
of 65 elements. The desired result should be (e.g. for the the first sentences):
"the color blue neutralize orange yellow reflection."
If you want to remove stopwords from each sentence
, you could use lapply
:
library(tm)
lapply(sentences, removeWords, stopwords())
#[[1]]
#[1] "" "color" "blue" "neutralize" "orange" "yellow" "reflection" "."
#[[2]]
#[1] "zod" "stabbed" "" "" "blue" "kryptonite" "."
#...
#...
However, from your expected output it looks you want to paste the text together.
lapply(sentences, paste0, collapse = " ")
#[[1]]
#[1] "the color blue neutralize orange yellow reflection ."
#[[2]]
#[1] "zod stabbed me with blue kryptonite ."
#....