Search code examples
rtmcorpusterm-document-matrix

How to select only a subset of corpus terms for TermDocumentMatrix creation in tm


I have a huge corpus, and I'm interested in only appearance of a handful of terms that I know up front. Is there a way to create a term document matrix from the corpus using the tm package, where only terms I specify up front are to be used and included?

I know I can subset the resultant TermDocumentMatrix of the corpus, but I want to avoid building the full term document matrix to start with, due to memory size constraint.


Solution

  • You can modify a corpus to keep only the terms you want by building a custom transformation function. See the Vignette for the tm package and the help for the content_transformer function for more information:

    library(tm)
    
    # Create a corpus from the text listed below
    corp = VCorpus(VectorSource(doc))
    
    # Custom function to keep only the terms in "pattern" and remove everything else
    (f <- content_transformer(function(x, pattern) 
      regmatches(x, gregexpr(pattern, x, perl=TRUE, ignore.case=TRUE))))
    

    (FYI, the second line of code just above is adapted from this SO answer.)

    # The pattern we'll search for
    keep = "sleep|dream|die"
    
    # Run the transformation function using the pattern above
    tm_map(corp, f, keep)[[1]]
    

    Here's the result of running the transformation function:

    <<PlainTextDocument (metadata: 7)>>
      c("die", "sleep", "sleep", "die", "sleep", "sleep", "Dream")
    

    Here's the original text I used to create the corpus:

    doc = "To be, or not to be, that is the question—
    Whether 'tis Nobler in the mind to suffer
    The Slings and Arrows of outrageous Fortune,
    Or to take Arms against a Sea of troubles,
    And by opposing, end them? To die, to sleep—
    No more; and by a sleep, to say we end
    The Heart-ache, and the thousand Natural shocks
    That Flesh is heir to? 'Tis a consummation
    Devoutly to be wished. To die, to sleep,
    To sleep, perchance to Dream; Aye, there's the rub"