From Stemming Words I taken the following custom stemming function:
stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell::hunspell_stem(term)[[1]]
if (length(stems) == 0) { # if there are no stems, use the original term
stem <- term
} else { # if there are multiple stems, use the last one
stem <- stems[[length(stems)]]
}
stem
}
It uses the hunspell
dictionary to do the stemming (package corpus
).
I tried this function on the following sentences.
sentences<-c("We're taking proactive steps to tackle ...",
"A number of measures we are taking to support ...",
"We caught him committing an indecent act.")
And then I performed the following operations:
library(qdap)
library(tm)
sentences <- iconv(sentences, "latin1", "ASCII", sub="")
sentences <- gsub('http\\S+\\s*', '', sentences)
sentences <- bracketX(sentences,bracket='all')
sentences <- gsub("[[:punct:]]", "",sentences)
sentences <- removeNumbers(sentences)
sentences <- tolower(sentences)
# Stemming
library(corpus)
stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell::hunspell_stem(term)[[1]]
if (length(stems) == 0) { # if there are no stems, use the original term
stem <- term
} else { # if there are multiple stems, use the last one
stem <- stems[[length(stems)]]
}
stem
}
sentences=text_tokens(sentences, stemmer = stem_hunspell)
sentences = lapply(sentences, removeWords, stopwords('en'))
sentences = lapply(sentences, stripWhitespace)
I can not explain the results:
[[1]]
[1] "" "taking" "active" "step" "" "tackle"
[[2]]
[1] "" "numb" "" "measure" "" "" "taking" ""
[9] "support"
[[3]]
[1] "" "caught" "" "committing" "" "decent"
[7] "act"
E.g. why commit and take appear in their ing-form? Why number became "numb"?
I think the answer is mostly that this is just the way hunspell
is stemming. We can check this in an easier example:
hunspell::hunspell_stem("taking")
#> [[1]]
#> [1] "taking"
hunspell::hunspell_stem("committing")
#> [[1]]
#> [1] "committing"
The ing-form is the only option offered by hunspell. To me this doesn't make much sense either and my suggestion would be to use a different stemmer. And while we're on it, I think you would also profit from switching to quanteda
instead of tm
:
library(quanteda)
sentences <- c("We're taking proactive steps to tackle ...",
"A number of measures we are taking to support ...",
"We caught him committing an indecent act.")
tokens(sentences, remove_numbers = TRUE) %>%
tokens_tolower() %>%
tokens_wordstem()
#> Tokens consisting of 3 documents.
#> text1 :
#> [1] "we'r" "take" "proactiv" "step" "to" "tackl" "."
#> [8] "." "."
#>
#> text2 :
#> [1] "a" "number" "of" "measur" "we" "are" "take"
#> [8] "to" "support" "." "." "."
#>
#> text3 :
#> [1] "we" "caught" "him" "commit" "an" "indec" "act" "."
The workflow is a lot cleaner in my opinion and the results make a bit more sense to me. quanteda
uses the SnowballC
package to do the stemming here, which you could integrate into your tm
workflow if you wanted. tokens
objects are text in the same order as the input object but tokenised (i.e., split into words).
If you still wanted to use hunspell
, you could do so with the following function, which clears some problems you seem to have ("number" is now correct):
stem_hunspell <- function(toks) {
# look up the term in the dictionary
stems <- vapply(hunspell::hunspell_stem(types(toks)), "[", 1, FUN.VALUE = character(1))
# if there are no stems, use the original term
stems[nchar(stems) == 0] <- types(toks)[nchar(stems) == 0]
tokens_replace(toks, types(toks), stems, valuetype = "fixed")
}
tokens(sentences, remove_numbers = TRUE, ) %>%
tokens_tolower() %>%
stem_hunspell()
#> Tokens consisting of 3 documents.
#> text1 :
#> [1] "we're" "taking" "active" "step" "to" "tackle" "." "."
#> [9] "."
#>
#> text2 :
#> [1] "a" "number" "of" "measure" "we" "are" "taking"
#> [8] "to" "support" "." "." "."
#>
#> text3 :
#> [1] "we" "caught" "him" "committing" "an"
#> [6] "decent" "act" "."