In a very simple and understandable term, because I read a lot of blogs which further more confused me,
Suppose I have five rows in my DataFrame
1. This is Foo, how can I help you
2. It might rain today
3. I love football
4. Crazy, Stupid & Love
5. I shot the sheriff
In this, Can anyone help me to understand which one should be called as a Document, Vocabulary and Corpus
In NLP, document concept can be a bit vague: a document is a unit, so it can correspond to different text objects, such as entire documents, sentences, passages...
In your example, "This is Foo, how can I help you"
is a document. "It might rain today"
is another document...
A corpus is a collection of documents. In your example, the corpus is composed by 5 documents.
The vocabulary is the list of all the words contained in the corpus, therefore all the words contained in all the documents.
Your vocabulary is
[&, can, crazy, foo, football, help, how, i, is, it, love, might, rain, sheriff, shot, stupid, the, this, today, you]