Search code examples
rtokenquanteda

How do I delete rare terms from a dfm?


I created a dfm using tokens() from the quanteda package. (size is roughly 40*2000) I now want to remove all tokens appearing in less than 15% of the documents. I am not really experienced in R and i dont know how to proceed. Is there maybe a way to utilize the docfreq variable from textstat_frequency() or do I have to use tokens_select() and a row of If() statements?


Solution

  • Yes, you want dfm_trim() which allows you to specify a document frequency threshold in terms of either counts (of documents) or proportions of documents.

    (Note: Once in a dfm, the word dimension elements are no longer tokens, but rather "features" in quanteda terminology.)

    Using a built-in example, the code below shows how to use dfm_trim() with a minimum document frequency threshold of 0.15 and a document frequency type of "prop", which treats the threshold you supply as a proportion. You can see from the change in the number of features that there has been significant trimming.

    library("quanteda")
    ## Package version: 2.0.1
    
    dfmat <- dfm(data_corpus_inaugural)
    print(dfmat, max_ndoc = 0, max_nfeat = 0)
    ## Document-feature matrix of: 58 documents, 9,360 features (91.8% sparse) and 4 docvars.
    
    dfm_trim(dfmat, min_docfreq = 0.15, docfreq_type = "prop") %>%
      print(max_ndoc = 0, max_nfeat = 0)
    ## Document-feature matrix of: 58 documents, 1,304 features (65.3% sparse) and 4 docvars.