I am trying to build a Doc2Vec model. I have a list of sentences with their labels, labeled using Gensim’s LabeledSentence() function. After building the model, I see that they used build_vocab() on the labeled sentences before training the model.
Can someone explain what does build_vocab() do and what happens if I don't use it !?
Please check out the following pictures:
The build_vocab()
step is how the model discovers the set of all possible words/doc-tags – and in the case of words, finds which words occur more than min_count
times.
You have to use it: an attempt to train a model that hasn't gone through that discovery step will error.
(If you use the form of model-instantiation where you supply your corpus when creating the object, both build_vocab()
and train()
will be called automatically for you.)
Separately regarding your mention of LabeledSentence
:
To stay up to date with preferred terminology/types, you should be using the TaggedDocument
class. The individual training items are better described as documents than sentences, and within the model their whole-text doc-vector keys are called tags, not labels. (In some cases, they might also be the sort of labels used by classifiers, but not always, and most typically the tags are unique per-document IDs. So the term 'tag' is preferred in the code to discourage conflating these keys-for-doc-vectors with other things that might be 'labels.)
(The LabeledSentence
class was an older name, and is now simply an alias to TaggedDocument
. So using it as a name will work, but is mismatched with the rest of the code.)