Search code examples
machine-learningnlpword2vechierarchicalsoftmax

What's inside inner vertices in Word2Vec Hierarchical Softmax?


I have a question about Hierarchical Softmax. Actually, I do not quite understand what is stored in inner vertices (which are not leaf vertices). I clearly understand the main idea of this algorithm, but each step we calculate dot product of input word embedding with the word embedding of inner vertice. So what vectors are inside these inner vertices? Is it randomly initialized vectors of size that equals to embedding_size and then their coordinates change due to backpropagation step until we stop?


Solution

  • While there are many slightly-different ways to think about it, it may help to consider the values in the (trained, frozen) neural network as associated with edges moreso than vertexes (nodes).

    The network "projection weights" leading from an (abstract) one-hot encoding of each known word into the network are essentially the actual per-word word-vectors. Assuming a common case of 300d vectors, the 300 edges from the single-word node to the inner nodes are that words.

    Then, there's another set of edge-weights from the internal activation to the "output" layer, which is offering the network's training goal of in-context word-projection.

    In the (more common & usual default) negative-sampling approach, each output node corresponds to a single predictable word. That's a very easy output-shape to visualize. And the value of negative-sampling is that you only check the activations of the desired word, and n more randomly chosen negative words, to perform your training updates. That's way less calculation than if you checked the output values at all V (size of vocabulary) output nodes, and still works pretty well, and doesn't get more expensive with larger vocabularies (unless you choose for other reasons to also increase your choice of n negative samples).

    In hierarchical softmax, the interpretation of the output nodes is more complicated. Rather than the activation at a single node indicating the prediction of a single word, a (varying) set of nodes must have the right on/off activations to communicate the variable-length huffman-code of a word.

    So in HS, to backprop the network more towards predicting your desired "positive" word (from one input context), the algorithm considers just those nodes involved in that words nique coding – a smaller set of nodes for the most-ommon words, but a larger set of nodes for rarer words – and nudges each of them more towards the pattern that predicts the desired word. Again, you get the sparse training efficiency of updating only a tiny subset, far smaller than all V nodes, each training-step. But, the cost will vary based on the target word, and grow with the log of V as vocabulary-size grows. Further, as the original/naive assignment of codes is based strictly on word-frequency, quite-dissimilar words may have very-similar codings, perhaps causing more word-to-word interference. (There were hints in the original word2vec.c release of refining the HS word-codings over time to ensure similar words share similar Huffman codings, but I've seen litle followup on that idea, perhaps because of the dominance of negative-sampling.)

    So, in an HS network, the weights from the inner-activations, to the output nodes, are tuned to indicate, by Huffman code, which word is the preferred prediction from a context.

    In the word2vec implementation I'm most familiar with, Python Gensim, these "hidden to output" weights are not even randomly initialized at the beginning, instead left as 0.0 – and I think this was directly copied from the initialization of Google's word2vec.c release. But, as soon as training begins, the explict random initialization of those "input weights" (initial random input word-vectors) means those weights are immediately perturbed in a way that at 1st is nearly all random but becomes more helpful over the SGD training.

    So:

    • those inner weights start 0.0 but quickly start reflecting the influence of the (initially-random) word vectors and training examples
    • they're usually not harvested from the network after training, like the final word-vectors are – but would be kept around if you wanted to continue training later, and perhaps provide a running start to future training runs

    (In the negative-sampling case, some research suggested those hidden-to-output weights could also be interpreted as same embedding_size per-word vectors, with some usefulness: see paper by Mitra et al at Microsoft about "Dual Word Embeddings". But given the varying-length codings of output words in the HS case, extracting/interpretating those output-weights would be trickier.)

    In the implementation I'm most familiar with, the It may help to think instead of the shallow neural network's "projection layer" (effectively the word-vectors themselves, as each virtual "single node" 1-hot word has its embedding_size out-weights) and "hidden layer