Search code examples
nlpstanford-nlpopennlpcorpus

Resource that provides number of documents where the term is covered


I am looking for resources that provides the number of documents a term is covered in. For example, there is about 25 billion documents that contains the term "the" in the indexed internet.


Solution

  • I don't know of any document frequency lists for large corpora such as the web, but there are some term frequency lists available. For example, there are the frequency lists from the web corpora compiled by the Web-As-Corpus Kool Yinitiative, which include the 2-billion ukWaC English web corpus. Alternatively, there are the n-grams from the Google Books Corpus.

    It has been shown that such term frequency counts can be used to reliably approximate document frequency counts.