Search code examples
pythongensimword2vec

Does the "iter" parameter of gensim.models.Word2Vec method iterate over the whole corpus or the sentence passed to it at a time?


I am using gensim to train a word2Vec model. Here I am passing one sentence at a time to the gensim.models.Word2Vec() method from my corpus to gradually train the model on my whole corpus. But I am confused what should the value of iter parameter be as I'm not sure whether it iterates over the passed sentence n times or the whole corpus.

I have tried checking the documentation of gensim. it states the definition as follows:

iter (int, optional) – Number of iterations (epochs) over the corpus.

But I am confused as I am not passing the whole corpus but only a single sentence on each iteration.

My line in the code that trains the model looks like this:
model = gensim.models.Word2Vec(data, min_count=2, window=arg.window_size, size=arg.dim_size, workers=4, sg=0, hs=0, negative=10, ns_exponent=0.75, alpha=0.025, iter=1)
Here "data" represents a single sentence passed at a time from a generator.

Suppose I have a corpus of 2 sentences. "X is a variable. Y is a variable too.". The model receives data = "X is a variable." first and data = "Y is a variable too." in 2nd iteration. Now to clarify, my question is, whether iter = 50 will train my model iterating though "X is a variable." 50 times & "Y is a variable too." 50 times or will it iterating though "X is a variable. Y is a variable too." (my whole corpus) 50 times.


Solution

  • Word2Vec is a class. Calling it as model = Word2Vec(...) returns one new model instance.

    If you supply data to that instantiation call, it expects a full training corpus, with all examples, as the data (sentences parameter). It will iterate over that data once to learn the vocabulary, then again the number of times specified in the epochs argument for training. (This argument was previously called iter, which still works.)

    So:

    • You shouldn't be calling Word2Vec(...) multiple times with single texts. You should call it once, with a re-iterable sequence of all your texts as the data.
    • That full supplied corpus will be iterated over epochs + 1 times as part of the model's initialization & training, via the single call to Word2Vec(...).

    You should enable logging at the INFO level to get a better idea of what's happening when you try different approaches.

    You should also look at working examples, like the word2vec.ipynb notebook bundled with gensim inside its docs/notebooks directory, to understand usual usage patterns. (This is best viewed, and interactively run, from your local installation – but can also be browsed online at https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb.)

    Note that you can avoid supplying any data to the Word2Vec(...) instantiation call, but then you need to call model.build_vocab(full_corpus) and then model.train(full_corpus, epochs=desired_iterations) to complete the model initialization & training. (While you can then continue calling train() with fragments of training data, that's an advanced & highly error-prone approach. Only calling it just once, with one combined full training set, will easily and automatically do the right thing with the training learning-rate decay and number of training iterations.)