nlp data-science gensim text-classification doc2vec

Understanding the role of the function build_vocab in Doc2Vec

I have recently started studying Doc2Vec model. I have understood its mechanism and how it works. I'm trying to implement it using gensim framework. I have transormed my training data into TaggedDocument. But i have one question : What is the role of this line model_dbow.build_vocab([x for x in tqdm(train_tagged.values)]) ? is it to create random vectors that represent text ? Thank you for your help

Solution

The Doc2Vec model needs to know several things about the training corpus before it is fully allocated & initialized.

First & foremost, the model needs to know the words present & their frequencies – a working vocabulary – so that it can determine the words that will remain after the min_count floor is applied, and allocate/initialize word-vectors & internal model structures for the relevant words. The word-frequencies will also be used to influence the random sampling of negative-word-examples (for the default negative-sampling mode) and the downsampling of very-frequent words (per the sample parameter).

Additionally, the model needs to know the rough size of the overall training set in order to gradually decrement the internal alpha learning-rate over the course of each epoch, and give meaningful progress-estimates in logging output.

At the end of build_vocab(), all memory/objects needed for the model have been created. Per the needs of the underlying algorithm, all vectors will have been initialized to low-magnitude random vectors to ready the model for training. (It essentially won't use any more memory, internally, through training.)

Also, after build_vocab(), the vocabulary is frozen: any words presented during training (or later inference) that aren't already in the model will be ignored.