Search code examples
pythongensimword2vec

Inner workings of Gensim Word2Vec


I have a couple of issues regarding Gensim in its Word2Vec model.

The first is what is happening if I set it to train for 0 epochs? Does it just create the random vectors and calls it done. So they have to be random every time, correct?

The second is concerning the WV object in the doc page says:

This object essentially contains the mapping between words and embeddings.
After training, it can be used directly to query those embeddings in various ways.  
See the module level docstring for examples.

But that is not clear to me, allow me to explain I have my own created word vectors which I have substitute in the

   word2vecObject.wv['word'] = my_own

Then call the train method with those replacement word vectors. But I would like to know which part am I replacing, is it the input to hidden weight layer or the hidden to input? This is to check if it can be called pre-training or not. Any help? Thank you.


Solution

  • I've not tried the nonsense parameter epochs=0, but it might behave as you expect. (Have you tried it and seen otherwise?)

    However, if your real goal is to be able to tamper with the model after initialization, but before training, the usual way to do that is to not supply any corpus when constructing the model instance, and instead manually do the two followup steps, .build_vocab() & .train(), in your own code - inserting extra steps between the two. (For even finer-grained control, you can examine the source of .build_vocab() & its helper methods, and simply ensure you do all those necessary things, with your own extra steps interleaved.)

    The "word vectors" in the .wv property of type KeyedVectors are essentially the "input projection layer" of the model: the data which converts a single word into a vector_size-dimensional dense embedding. (You can think of the keys – word token strings – as being somewhat like a one-hot word-encoding.)

    So, assigning into that structure only changes that "input projection vector", which is the "word vector" usually collected from the model. If you need to tamper with the hidden-to-output weights, you need to look at the model's .syn1neg (or .syn1 for HS mode) property.