Search code examples

Why does Tensorflow's sampled_softmax_loss force you to use a bias, when experts recommend no bias be used for Word2Vec?

All the tensorflow implementations of Word2Vec that I have seen has a bias in the negative sampling softmax function, including on the official tensorflow website

loss = tf.reduce_mean(

This is from Google's free Deep Learning course

 loss = tf.reduce_mean(
    tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed,
                               labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))

However, from both Andrew Ng and Richard Socher's lectures, they do not include a bias in their negative sampling softmaxes.

Even where this idea originated, Mikolov states that:

biases are not used in the neural network, as no significant improvement of performance was observed - following the Occam's razor, the solution is as simple as it needs to be.

Mikolov, T.: Statistical Language Models Based on Neural Networks, p. 29

So why do the official tensorflow implementations have a bias, and why does there not seem to be an option to not include a bias in the sampled_softmax_loss function ?


  • The exercise you link defines softmax_biases to be zeros:

    softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))

    That is: they're not using any actual bias in their word2vec example.

    The sampled_softmax_loss() function is generic and used for many neural-networks; its decision to require a biases argument is unrelated to what's best for one particular neural-network application (word2vec), and accommodates the word2vec case by allowing (as here) all zeros.