TensorFlow word2vec tutorial input

While going through the TensorFlow word2vec tutorial, I had a hard time following the tutorial's explanation regarding the placeholders that store the inputs to the skip-gram model. The explanation states that

The skip-gram model takes two inputs. One is a batch full of integers representing the source context words, the other is for the target words... Now what we need to do is look up the vector for each of the source words in the batch... Now that we have the embeddings for each word, we'd like to try to predict the target word.

However, since we are using the skip-gram model (as opposed to CBOW), shouldn't we instead have to look up the word vector for each of the target words, and then predict the context words given the target words?

In addition, I'm assuming that the code below first declares a placeholder for the target words (inputs), and then one for the source context words (our labels).

train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])

Am I misunderstanding the tutorial?

Solution

The skip-gram tutorial assumes your dataset made like this one:

(quick, the), (quick, brown), (brown, quick), (brown, fox), ...

where it is intended as pair (input, output) = (center_word, context_word).

As a matter of fact, if you average over multiple (input, output) pairs, you will obtain a behavior similar to the one of predicting each of the context words at each example.

This choice is also justified by the use of NCE as loss function, and NCE tries to distinguish a single target word (one of the words of the context) among some noise words (randomly selected).

Your inputs and outputs placeholder should have the same dimension (batch_size,1) but the input are simply (batch_size) because the embedding layer automatically expands the dimension while the loss function (where you provide labels) needs a matrix as input.

So, the tutorial is not a precise implementation of the skip-gram model of Mikolov, but makes some approximations for the sake of code simplicity and readability.