When reading the Doc2Vec documentation of gensim, I get a bit confused about some options. For example, the constructor of Doc2Vec has a parameter iter:
iter (int) – Number of iterations (epochs) over the corpus.
Why does the train method then also have a similar parameter called epochs?
epochs (int) – Number of iterations (epochs) over the corpus.
What is the difference between both? There's one more paragraph on it in the docs:
To avoid common mistakes around the model’s ability to do multiple training passes itself, an explicit epochs argument MUST be provided. In the common and recommended case, where train() is only called once, the model’s cached iter value should be supplied as epochs value.
But I do not really understand why the constructor needs a iter parameter and what exactly should be provided for it.
EDIT:
I just saw that there is also the possibility to specify the corpus directly in the constructor rather than calling train() separately. So I think in this case, iter would be used and otherwise epochs. Is that correct?
If so, what is the difference between specifying the corpus in the constructor and calling train() manually? Why would one choose the one or other?
EDIT 2:
Although not mentioned in the docs, iter is now depreciated as parameter of Doc2Vec. It was renamed to epochs to be consistent with the parameter of train(). Training seems to work with that, although I struggle with MemoryErrors.
The parameter in the constructor was originally called iter
, and when doing everything via a single constructor call – supplying the corpus in the constructor – that value would just be used as the number of training passes.
When the parameters to train()
were expanded and made mandatory to avoid common errors, the term epochs
was chosen as more descriptive, and distinct from the iter
value.
When you specify the corpus in the constructor, build_vocab()
and train()
will be called for you automatically, as part of the construction. For most simple uses, that's fine.
But, if you let those automatic calls happen, you lose the chance to time the steps separately, or to tamper with results of the vocabulary steps before starting training, or to call train()
multiple times (which is usually a bad idea unless you're sure you know what you're doing).