neural-network nlp classification word2vec softmax

How's it even possible to use softmax for word2vec?

How is it possible to use softmax for word2vec? I mean softmax outputs probabilities of all classes which sum up to 1, e.g. [0, 0.1, 0.8, 0.1]. But if my label is, for example [0, 1, 0, 1, 0] (multiple correct classes), then it is impossible for softmax to output the correct value?

Should I use softmax instead? Or am I missing something?

Solution

I suppose you're talking about Skip-Gram model (i.e., predict the context word by the center), because CBOW model predicts the single center word, so it assumes exactly one correct class.

Strictly speaking, if you were to train word2vec using SG model and ordinary softmax loss, the correct label would be [0, 0.5, 0, 0.5, 0]. Or, alternatively, you can feed several examples per center word, with labels [0, 1, 0, 0, 0] and [0, 0, 0, 1, 0]. It's hard to say, which one performs better, but the label must be a valid probability distribution per input example.

In practice, however, ordinary softmax is rarely used, because there are too many classes and strict distribution is too expensive and simply not needed (almost all probabilities are nearly zero all the time). Instead, the researchers use sampled loss functions for training, which approximate softmax loss, but are much more efficient. The following loss functions are particularly popular:

These losses are more complicated than softmax, but if you're using tensorflow, all of them are implemented and can be used just as easily.