I have recently started studying Doc2Vec model.
I have understood its mechanism and how it works.
I'm trying to implement it using gensim framework.
I have transormed my training data into TaggedDocument.
But i have one question :
What is the role of this line model_dbow.build_vocab([x for x in tqdm(train_tagged.values)])
?
is it to create random vectors that represent text ?
Thank you for your help
The Doc2Vec
model needs to know several things about the training corpus before it is fully allocated & initialized.
First & foremost, the model needs to know the words present & their frequencies – a working vocabulary – so that it can determine the words that will remain after the min_count
floor is applied, and allocate/initialize word-vectors & internal model structures for the relevant words. The word-frequencies will also be used to influence the random sampling of negative-word-examples (for the default negative-sampling mode) and the downsampling of very-frequent words (per the sample
parameter).
Additionally, the model needs to know the rough size of the overall training set in order to gradually decrement the internal alpha
learning-rate over the course of each epoch, and give meaningful progress-estimates in logging output.
At the end of build_vocab()
, all memory/objects needed for the model have been created. Per the needs of the underlying algorithm, all vectors will have been initialized to low-magnitude random vectors to ready the model for training. (It essentially won't use any more memory, internally, through training.)
Also, after build_vocab()
, the vocabulary is frozen: any words presented during training (or later inference) that aren't already in the model will be ignored.