R text mining with TM: Does a document contain words that are rare

Using TM package in R, how can I score a document in term of its uniqueness? I want to somehow separate documents with very unique words from documents that contain often used words.

I know how to find the frequently used words and least used words with e.g. findFreqTerms, but how do I score a document with regards to it's uniqueness?

I am struggling to come up with a good solution.

Solution

A good starting point for assessing which words are used only in some documents is the so-called tf-idf weighting (tidytext package vignette). This assigns a score to each (word, document) combination, so once you have that calculated you can try summarizing along the 'document' margin, maybe literally just colMeans, to get a sense of how many relatively unique terms it uses.

To separate documents, a weighting scheme like tf-idf may be better than just finding the rarest overall tokens: a rare word used once in most documents is treated quite differently than a word used several times in just a few documents.

R packages TM, tidytext, and quanteda all have functions to calculate this.