Search code examples
nlpword2vecword-embedding

Finding both target and center word2vec matrices


I've read and heard(In the CS224 of Stanford) that the Word2Vec algorithm actually trains two matrices(that is, two sets of vectors.) These two are the U and the V set, one for words being a target and one for words being the context. The final output is the average of these two. I have two questions in mind. one is that:

  • Why do we get an average of two vectors? Why it makes sense? Don't we lose some information?

  • The second question is, using pre-trained word2vec models, how can I get access to both matrices? Is there any downloadable word2vec with both sets of vectors? I don't have enough resources to train a new one.

Thanks


Solution

  • That relayed description isn't quite right. The word-vectors traditionally retrieved from a word2vec model come from a "projection matrix" which converts individual words to a right-sized input-vector for the shallow neural network.

    (You could think of the projection matrix as turning a one-hot encoding into a dense-embedding for that word, but libraries typically implement this via a dictionary-lookup – eg: "what row of the vectors-matrix should I consult for this word-token?")

    There's another matrix of weights leading to the model's output nodes, whose interpretation varies based on the training mode. In the common default of negative-sampling, there's one node per known word, so you could also interpret this matrix as having a vector per word. (In hierarchical-softmax mode, the known-words aren't encoded as single output nodes, so it's harder to interpret the relationship of this matrix to individual words.)

    However, this second vector per word is rarely made directly available by libraries. Most commonly, the word-vector is considered simply the trained-up input vector, from the projection matrix. For example, the export format from Google's original word2vec.c release only saves-out those vectors, and the large "GoogleNews" vector set they released only has those vectors. (There's no averaging with the other output-side representation.)

    Some work, especially that of Mitra et all of Microsoft Research (in "Dual Embedding Space Models" & associated writeups) has noted those output-side vectors may be of value in some applications as well – but I haven't seen much other work using those vectors. (And, even in that work, they're not averaged with the traditional vectors, but consulted as a separate option for some purposes.)

    You'd have to look at the code of whichever libraries you're using to see if you can fetch these from their full post-training model representation. In the Python gensim library, this second matrix in the negative-sampling case is a model property named syn1neg, following the naming of the original word2vec.c.