Search code examples
word2vec

How does Word2Vec (CBOW and Skip-gram) create the training data?


How does Word2Vec (CBOW and Skip-gram) create the training data? Here is my interpretation of how CBOW and Skip-gram create the training data (the crossed-out text indicates that the training example was already added to the training dataset):

CBOW: CBOW training data CBOW neural network

Skip-gram: Skip-gram training data Skip-gram neural network

The above interpretation cannot be correct. If it were correct, it would mean that we end up with the exact same training dataset for both CBOW and Skip-gram. Since the neural network architecture is the same for both, this would result in the exact same word vectors, which is clearly not the case in reality.

Although there are several resources that describe Word2Vec (CBOW and Skip-gram) briefly, I found it challenging to locate detailed explanations that provide a clear view of what is actually fed to the neural network.


Solution

  • When you need details, it's best to go to the source-code of long-used, well-debugged implementations, rather than descriptions/diagrams from sources of unclear competence.

    I don't know where your slides come from, but the "CBOW training data" examples of training microexamples seems somewhat confused. For example, the line that reads…

    UVT is great! → (UVT, is) and (great, is)

    …would more realistically be shown as…

    UVT is great! → (mean(UVT, great), is)

    That is, in CBOW, all words from the context/input window get combined – usually by average, though early implementations also offered a 'sum' option – before being used to predict the center/output word.

    So, a more-accurate description of the training-examples created by CBOW of the text ['UVT', 'is', 'great'] would be:

    UVT is great! → (mean(UVT,), is)

    UVT is great! → (mean(UVT, great), is)

    UVT is great! → (mean(is,), great)

    That is, in CBOW, one pass over a 3-word text generates exactly, and only, 3 micro-examples for the internal neural-network: one for each target word.

    But also, a pass over the next 3-word text (['UniBuc', 'is', 'great']) will create exactly 3 more micro-examples for the internal neural-network, including the repeated example, not just the 2 examples on that slide.

    That is: there is no de-duplication of micro-examples in usual word2vec implementations as the crossing-out implies in your clips: the repetition of frequent context->target inputs has an important weighting influence. (Though, the parameter sample also controls a probabilistic dropping of very-frequent words, to prevent their influence from overpowering the larger number of less-frequent words.)

    Another reason the examples in the slides you've clipped are suboptimal is that with a mere 3-word text, and mere 1-word window, there's not great contrast between effects of window size, extent-of-text, etc.

    Reading actual implementation source code, or top answers here, is likely to provide a better understanding than that source.