I created a dfm using tokens() from the quanteda package. (size is roughly 40*2000) I now want to remove all tokens appearing in less than 15% of the documents. I am not really experienced in R and i dont know how to proceed. Is there maybe a way to utilize the docfreq variable from textstat_frequency() or do I have to use tokens_select() and a row of If() statements?
Yes, you want dfm_trim()
which allows you to specify a document frequency threshold in terms of either counts (of documents) or proportions of documents.
(Note: Once in a dfm, the word dimension elements are no longer tokens, but rather "features" in quanteda terminology.)
Using a built-in example, the code below shows how to use dfm_trim()
with a minimum document frequency threshold of 0.15 and a document frequency type of "prop", which treats the threshold you supply as a proportion. You can see from the change in the number of features that there has been significant trimming.
library("quanteda")
## Package version: 2.0.1
dfmat <- dfm(data_corpus_inaugural)
print(dfmat, max_ndoc = 0, max_nfeat = 0)
## Document-feature matrix of: 58 documents, 9,360 features (91.8% sparse) and 4 docvars.
dfm_trim(dfmat, min_docfreq = 0.15, docfreq_type = "prop") %>%
print(max_ndoc = 0, max_nfeat = 0)
## Document-feature matrix of: 58 documents, 1,304 features (65.3% sparse) and 4 docvars.