I have a two-fold question about the Skip-Gram model in Word2Vec:
The first part is about structure: as far as I understand it, the Skip-Gram model is based on one neural network with one input weight matrix W, one hidden layer of size N, and C output weight matrices W' each used to produce one of the C output vectors. Is this correct?
The second part is about the output vectors: as far as I understand it, each output vector is of size V and is a result of a Softmax function. Each output vector node corresponds to the index of a word in the vocabulary, and the value of each node is the probability that the corresponding word occurs at that context location (for a given input word). The target output vectors are not, however, one-hot encoded, even if the training instances are. Is this correct?
The way I imagine it is something along the following lines (made-up example):
Assuming the vocabulary ['quick', 'fox', 'jumped', 'lazy', 'dog'] and a context of C=1, and assuming that for the input word 'jumped' I see the two output vectors looking like this:
[0.2 0.6 0.01 0.1 0.09]
[0.2 0.2 0.01 0.16 0.43]
I would interpret this as 'fox' being the most likely word to show up before 'jumped' (p=0.6), and 'dog' being the most likely to show up after it (p=0.43).
Do I have this right? Or am I completely off?
Your understanding in both parts seem to be correct, according to this paper :
The paper explains word2vec in detail and at the same time, keeps it very simple - it's worth a read for a thorough understanding of the neural net architecture used in word2vec.
Referring to the example you mentioned, with C=1
and with a vocabulary of ['quick', 'fox', 'jumped', 'lazy', 'dog']
If the output from the skip-gram is [0.2 0.6 0.01 0.1 0.09]
, where the correct target word is 'fox'
then error is calculated as:
[0 1 0 0 0] - [0.2 0.6 0.01 0.1 0.09] = [-0.2 0.4 -0.01 -0.1 -0.09]
and the weight matrices are updated to minimize this error.