Using TM package in R, how can I score a document in term of its uniqueness? I want to somehow separate documents with very unique words from documents that contain often used words.
I know how to find the frequently used words and least used words with e.g. findFreqTerms, but how do I score a document with regards to it's uniqueness?
I am struggling to come up with a good solution.
A good starting point for assessing which words are used only in some documents is the so-called tf-idf weighting (tidytext package vignette). This assigns a score to each (word, document) combination, so once you have that calculated you can try summarizing along the 'document' margin, maybe literally just colMeans
, to get a sense of how many relatively unique terms it uses.
To separate documents, a weighting scheme like tf-idf may be better than just finding the rarest overall tokens: a rare word used once in most documents is treated quite differently than a word used several times in just a few documents.
R packages TM, tidytext, and quanteda all have functions to calculate this.