I am using gensim to train a word2Vec model. Here I am passing one sentence at a time to the gensim.models.Word2Vec() method from my corpus to gradually train the model on my whole corpus. But I am confused what should the value of iter parameter be as I'm not sure whether it iterates over the passed sentence n times or the whole corpus.
I have tried checking the documentation of gensim. it states the definition as follows:
iter (int, optional) – Number of iterations (epochs) over the corpus.
But I am confused as I am not passing the whole corpus but only a single sentence on each iteration.
My line in the code that trains the model looks like this:
model = gensim.models.Word2Vec(data, min_count=2, window=arg.window_size, size=arg.dim_size, workers=4, sg=0, hs=0, negative=10, ns_exponent=0.75, alpha=0.025, iter=1)
Here "data" represents a single sentence passed at a time from a generator.
Suppose I have a corpus of 2 sentences. "X is a variable. Y is a variable too.". The model receives data = "X is a variable." first and data = "Y is a variable too." in 2nd iteration. Now to clarify, my question is, whether iter = 50 will train my model iterating though "X is a variable." 50 times & "Y is a variable too." 50 times or will it iterating though "X is a variable. Y is a variable too." (my whole corpus) 50 times.
Word2Vec
is a class. Calling it as model = Word2Vec(...)
returns one new model instance.
If you supply data to that instantiation call, it expects a full training corpus, with all examples, as the data (sentences
parameter). It will iterate over that data once to learn the vocabulary, then again the number of times specified in the epochs
argument for training. (This argument was previously called iter
, which still works.)
So:
Word2Vec(...)
multiple times with single texts. You should call it once, with a re-iterable sequence of all your texts as the data. epochs
+ 1 times as part of the model's initialization & training, via the single call to Word2Vec(...)
.You should enable logging at the INFO level to get a better idea of what's happening when you try different approaches.
You should also look at working examples, like the word2vec.ipynb
notebook bundled with gensim
inside its docs/notebooks
directory, to understand usual usage patterns. (This is best viewed, and interactively run, from your local installation – but can also be browsed online at https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb.)
Note that you can avoid supplying any data to the Word2Vec(...)
instantiation call, but then you need to call model.build_vocab(full_corpus)
and then model.train(full_corpus, epochs=desired_iterations)
to complete the model initialization & training. (While you can then continue calling train()
with fragments of training data, that's an advanced & highly error-prone approach. Only calling it just once, with one combined full training set, will easily and automatically do the right thing with the training learning-rate decay and number of training iterations.)