Search code examples
rquanteda

Drop documents from corpus in Quanteda if two conditions are met


corpus_subset specifies the documents that should be kept, but what about specifying the documents to drop? Assume for example, that I want to drop documents where the term "terorrism" appear, only as long as the term appears before the year 2001.

dfm_terror <- dfm(data_corpus_inaugural, select = "terrorism", valuetype = c("fixed"))
docvars(data_corpus_inaugural, "Terrorism") <- dfm_terror

documents_to_remove <- corpus_subset(data_corpus_inaugural, Terrorism >= 1 & Year < 2001)

Solution

  • corpus_subset keeps the documents specified in your subset as you correctly describe. So Terrorism >= 1 & Year < 2001 will return the below document.

                Year President FirstName Terrorism
    1981-Reagan 1981    Reagan    Ronald         1
    

    But to get the reverse just negate the subset selection. This will select all the documents except the one listed above.

    corpus_subset(data_corpus_inaugural, !(Terrorism >= 1 & Year < 2001))