Search code examples
rtext-miningtm

R text mining with TM: Does a document contain words that are rare


Using TM package in R, how can I score a document in term of its uniqueness? I want to somehow separate documents with very unique words from documents that contain often used words.

I know how to find the frequently used words and least used words with e.g. findFreqTerms, but how do I score a document with regards to it's uniqueness?

I am struggling to come up with a good solution.


Solution

  • A good starting point for assessing which words are used only in some documents is the so-called tf-idf weighting (tidytext package vignette). This assigns a score to each (word, document) combination, so once you have that calculated you can try summarizing along the 'document' margin, maybe literally just colMeans, to get a sense of how many relatively unique terms it uses.

    To separate documents, a weighting scheme like tf-idf may be better than just finding the rarest overall tokens: a rare word used once in most documents is treated quite differently than a word used several times in just a few documents.

    R packages TM, tidytext, and quanteda all have functions to calculate this.