I have a question about a shape of the weights for the Softmax Layer.
Suppose our vocabulary is 10000 words and our Embedding layer will reduce the dimensionality to 300.
So an Input is a one-hot-vector of length 10000 and Embedding layer has 300 neurons. It means, that weight matrix from Input layer to the Embedding layer has shape 10000*300(number of words in vocabulary* neurons in Embedding layer).
According to this tutorial(https://www.kaggle.com/christofer/word2vec-skipgram-model-with-tensorflow) and many others the next weight matrix(that connects Embedding layer and Softmax classifier) has the same shape(number of words in vocabulary* neurons in Embedding layer or in our case 10000 * 300). I don't understand why? Shouldnt it be 300 * 10000(because we have to predict 10000 probabilities for each class)?
Can you explain me this?
It's because of the tf.nn.sampled_softmax_loss
function. The way this function is designed, it needs the weight matrix to have the shape, [vocabulary size, dim]
.
From the documentation,
weights: A Tensor of shape [num_classes, dim], or a list of Tensor objects whose concatenation along dimension 0 has shape [num_classes, dim]. The (possibly-sharded) class embeddings.
The way sampled_softmax_loss
works is by sampling weights belonging to a subset of output nodes that are going to be optimized every iteration (i.e. not running optimization on weights for all output nodes). The way it's done is using embedding_lookup
. Therefore, having the weight in the shape [vocab_size, dim]
makes it ideal for this purpose.