Search code examples
tensorflowdeep-learningword2vecword-embedding

Why does Tensorflow's sampled_softmax_loss force you to use a bias, when experts recommend no bias be used for Word2Vec?


All the tensorflow implementations of Word2Vec that I have seen has a bias in the negative sampling softmax function, including on the official tensorflow website

https://www.tensorflow.org/tutorials/word2vec#vector-representations-of-words

loss = tf.reduce_mean(
  tf.nn.nce_loss(weights=nce_weights,
                 biases=nce_biases,
                 labels=train_labels,
                 inputs=embed,
                 num_sampled=num_sampled,
                 num_classes=vocabulary_size))

This is from Google's free Deep Learning course https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/udacity/5_word2vec.ipynb

 loss = tf.reduce_mean(
    tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed,
                               labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))

However, from both Andrew Ng and Richard Socher's lectures, they do not include a bias in their negative sampling softmaxes.

Even where this idea originated, Mikolov states that:

biases are not used in the neural network, as no significant improvement of performance was observed - following the Occam's razor, the solution is as simple as it needs to be.

Mikolov, T.: Statistical Language Models Based on Neural Networks, p. 29 http://www.fit.vutbr.cz/~imikolov/rnnlm/thesis.pdf

So why do the official tensorflow implementations have a bias, and why does there not seem to be an option to not include a bias in the sampled_softmax_loss function ?


Solution

  • The exercise you link defines softmax_biases to be zeros:

    softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
    

    That is: they're not using any actual bias in their word2vec example.

    The sampled_softmax_loss() function is generic and used for many neural-networks; its decision to require a biases argument is unrelated to what's best for one particular neural-network application (word2vec), and accommodates the word2vec case by allowing (as here) all zeros.