tensorflow deep-learning nlp word2vec softmax

Shape of weights in the Softmax layer in Word2vec(skip-gram)

I have a question about a shape of the weights for the Softmax Layer.

Suppose our vocabulary is 10000 words and our Embedding layer will reduce the dimensionality to 300.

So an Input is a one-hot-vector of length 10000 and Embedding layer has 300 neurons. It means, that weight matrix from Input layer to the Embedding layer has shape 10000*300(number of words in vocabulary* neurons in Embedding layer).

According to this tutorial(https://www.kaggle.com/christofer/word2vec-skipgram-model-with-tensorflow) and many others the next weight matrix(that connects Embedding layer and Softmax classifier) has the same shape(number of words in vocabulary* neurons in Embedding layer or in our case 10000 * 300). I don't understand why? Shouldnt it be 300 * 10000(because we have to predict 10000 probabilities for each class)?

Can you explain me this?

Solution

It's because of the tf.nn.sampled_softmax_loss function. The way this function is designed, it needs the weight matrix to have the shape, [vocabulary size, dim].

From the documentation,

weights: A Tensor of shape [num_classes, dim], or a list of Tensor objects whose concatenation along dimension 0 has shape [num_classes, dim]. The (possibly-sharded) class embeddings.

Why this is the case?

The way sampled_softmax_loss works is by sampling weights belonging to a subset of output nodes that are going to be optimized every iteration (i.e. not running optimization on weights for all output nodes). The way it's done is using embedding_lookup. Therefore, having the weight in the shape [vocab_size, dim] makes it ideal for this purpose.