terminology relevance tf-idf significance

Comparing text frequencies in a document to frequency in a corpus

I want to analyse a document for items such as letters, bigrams, words, etc and compare how frequent they are in my document to how frequent they were over a large corpus of documents.

The idea is that words such as "if", "and", "the" are common in all documents but some words will be much more common in this document than is typical for the corpus.

This must be pretty standard. What is it called? Doing it the obvious way I always had a problem with novel words in my document but not in the corpus rating infinitely significant. How is this dealt with?

Solution

most likely you've already checked the tf-idf or some other metrics from okapi_bm25 family.

also you can check natural language processing toolkit nltk for some ready solutions

UPDATE: as for novel words, smoothing should be applied: Good-Turing, Laplace, etc.