All the tensorflow implementations of Word2Vec that I have seen has a bias in the negative sampling softmax function, including on the official tensorflow website
https://www.tensorflow.org/tutorials/word2vec#vector-representations-of-words
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=train_labels,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))
This is from Google's free Deep Learning course https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/udacity/5_word2vec.ipynb
loss = tf.reduce_mean(
tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed,
labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))
However, from both Andrew Ng and Richard Socher's lectures, they do not include a bias in their negative sampling softmaxes.
Even where this idea originated, Mikolov states that:
biases are not used in the neural network, as no significant improvement of performance was observed - following the Occam's razor, the solution is as simple as it needs to be.
Mikolov, T.: Statistical Language Models Based on Neural Networks, p. 29 http://www.fit.vutbr.cz/~imikolov/rnnlm/thesis.pdf
So why do the official tensorflow implementations have a bias, and why does there not seem to be an option to not include a bias in the sampled_softmax_loss function ?
The exercise you link defines softmax_biases
to be zeros:
softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
That is: they're not using any actual bias in their word2vec example.
The sampled_softmax_loss()
function is generic and used for many neural-networks; its decision to require a biases
argument is unrelated to what's best for one particular neural-network application (word2vec), and accommodates the word2vec case by allowing (as here) all zeros.