I'm working with the Quanteda package in R at the moment, and I'd like to calculate the ngrams of a set of stemmed words to get a quick-and-dirty estimate of what content words tend to be near each other. If I try:
twitter.files <- textfile(files)
twitter.docs <- corpus(twitter.files)
twitter.semantic <- twitter.docs %>%
dfm(removeTwitter = TRUE, ignoredFeatures = stopwords("english"),
ngrams = 2, skip = 0:3, stem = TRUE) %>%
trim(minCount = 50, minDoc = 2)
It only stems the final word in the bigrams. However, if I try to stem first:
twitter.files <- textfile(files)
twitter.docs <- corpus(twitter.files)
stemmed_no_stops <- twitter.docs %>%
toLower %>%
tokenize(removePunct = TRUE, removeTwitter = TRUE) %>%
removeFeatures(stopwords("english")) %>%
wordstem
twitter.semantic <- stemmed_no_stops %>%
skipgrams(n = 2, skip = 0:2) %>%
dfm %>%
trim(minCount=25, minDoc = 2)
Then Quanteda doesn't know how to work with the stemmed list; I'll get the error:
assignment of an object of class “NULL” is not valid for @‘ngrams’
in an object of class “dfmSparse”; is(value, "integer") is not TRUE
Is there an intermediate step I can do to use a dfm on the stemmed words, or to tell dfm
to stem first and do ngrams second?
I tried reproducing your example with the inaugural texts. Using a reproducible example from the package data, your code worked for me:
twitter.docs <- corpus(data_corpus_inaugural[1:5])
stemmed_no_stops <- twitter.docs %>%
tokens(remove_punct = TRUE, remove_twitter = TRUE) %>%
tokens_tolower() %>%
tokens_remove(stopwords("english")) %>%
tokens_wordstem()
lapply(stemmed_no_stops, head)
## $`1789-Washington`
## [1] "fellow-citizen" "senat" "hous" "repres" "among"
## [6] "vicissitud"
##
## $`1793-Washington`
## [1] "fellow" "citizen" "call" "upon" "voic" "countri"
##
## $`1797-Adams`
## [1] "first" "perceiv" "earli" "time" "middl" "cours"
##
## $`1801-Jefferson`
## [1] "friend" "fellow" "citizen" "call" "upon" "undertak"
##
## $`1805-Jefferson`
## [1] "proceed" "fellow" "citizen" "qualif" "constitut" "requir"
twitter.semantic <- stemmed_no_stops %>%
tokens_skipgrams(n = 2, skip = 0:2) %>%
dfm %>%
dfm_trim(min_count = 5, min_doc = 2)
twitter.semantic[1:5, 1:4]
# Document-feature matrix of: 5 documents, 4 features.
# 5 x 4 sparse Matrix of class "dfmSparse"
# features
# docs fellow_citizen let_u unit_state foreign_nation
# 1789-Washington 2 0 2 0
# 1793-Washington 1 0 0 0
# 1797-Adams 0 0 3 5
# 1801-Jefferson 5 5 0 0
# 1805-Jefferson 8 2 1 1