Search code examples
nlpnltk

Corpus vs Vocabulary vs Document in NLP


In a very simple and understandable term, because I read a lot of blogs which further more confused me,

  1. read but couldnt understand
  2. read but couldnt understand

Suppose I have five rows in my DataFrame

1. This is Foo, how can I help you
2. It might rain today
3. I love football
4. Crazy, Stupid & Love
5. I shot the sheriff

In this, Can anyone help me to understand which one should be called as a Document, Vocabulary and Corpus


Solution

  • In NLP, document concept can be a bit vague: a document is a unit, so it can correspond to different text objects, such as entire documents, sentences, passages... In your example, "This is Foo, how can I help you" is a document. "It might rain today" is another document...

    A corpus is a collection of documents. In your example, the corpus is composed by 5 documents.

    The vocabulary is the list of all the words contained in the corpus, therefore all the words contained in all the documents. Your vocabulary is [&, can, crazy, foo, football, help, how, i, is, it, love, might, rain, sheriff, shot, stupid, the, this, today, you]