How can I find and count words that are NOT in a given dictionary?
The example below counts every time specific dictionary words (clouds and storms) appear in the text.
library("quanteda")
txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."
mydict <- dictionary(list(all_terms = c("clouds", "storms")))
dfmat <- tokens(txt) %>%
tokens_select(mydict) %>%
dfm()
dfmat
The output:
docs clouds storms
text1 1 1
How can I instead generate a count of all words that are NOT in the dictionary (clouds/storms)? Ideally with stopwords excluded.
E.g., desired output:
docs Forty-four Americans ...
text1 1 1
When you check the help file for tokens_select (run ?tokens_select
) you can see that the third argument is selection
. The default value is "keep"
, yet what you want is "remove"
. Since this is a common thing to do, there is also a dedicated tokens_remove
command, which I use below to to remove stopwords.
dfmat <- tokens(txt) %>%
tokens_select(mydict, selection = "remove") %>%
tokens_remove(stopwords::stopwords(language = "en")) %>%
dfm()
dfmat
#> Document-feature matrix of: 1 document, 38 features (0.00% sparse) and 0 docvars.
#> features
#> docs forty-four americans now taken presidential oath . words spoken rising
#> text1 1 1 1 2 1 2 4 1 1 1
#> [ reached max_nfeat ... 28 more features ]
I think this is what you are trying to do.
Created on 2021-12-28 by the reprex package (v2.0.1)