Search code examples
nlpword2vecone-hot-encoding

How to convert several-hot encoding to dence vector?


Now I am doing an NLP experiment. What I am thinking of is very similar to Word2Vec. I think my way must already exist. Maybe there is out-of-the-box code. But I don't know where to find.

Word2Vec's input word vector is one-hot. So the size of each word vector is equal to the size of the vocab.

But my input word vector is a catenation of several one-hot vectors. Maybe it can be called 'several-hot'. It's much shorter than one-hot but still sparse. I still want to dencify it using Word2Vec's scheme.

I have used Gensim's Word2Vec model. It seems to accept only tokens as input. That means it converts tokens to one-hot vectors internally right? I would like to know if there exist any Word2Vec code that accepts custom input vectors.


Solution

  • In practice, Word2Vec models like in Gensim never truly intantiate a one-hot representation (sparse or not). Instead, they use the lookup key (word string) to pull up a dense vector, either in-training (where that vector is being adjusted) or post-training (when that vector is being returned for use elsewhere).

    (Abstractly, that dense vector is still a neural networks' internal weights, from the virtual "one-hot" input-layer to the hidden layer of smaller dimensionality. But in implementations, it's a dictionary lookup of a word key to a row in a matrix, that row being the traditional "word vector".)

    If you have clusters of N words that you want to use an existing model, which only has one vector per word, you may just want to look-up all N words individually, and either add, or average, them together. That's effectively what the neural-network is doing during training, in certain modes (like CBOW), where N words are the input to predict one target 'center' word.

    (If instead you are training your own word2vec model, and certain tri-grams are known to be relevant entities for which you want to learn new unique vectors, possibly unrelated to the unigram vectors for the same words, that would require some level of preprocessing of your training data to essentially promote those trigrams to be pseudowords, and let them go through the same iterative training process as true unigrams do in the usual case.)

    ADDITIONAL THOUGHTS AFTER COMMENT BELOW:

    I'm a bit unclear about what sort of text/goals might give rise to your specific needs, but vaguely, in addition to considering an average-of-multiple-words, you may also want to look into the FastText variant of word2vec.

    FastText will learn (alongside full-word vectors) additional vectors for substrings of words seen in training. For languages where word-morphology (word roots) give good hints to meaning, or situations with typos & other corruption in data, these subword vectors can later help synthesize better-than-nothing guess-vectors for new out-of-vocabulary ("OOV") words that weren't seen during training.

    It does this by combining other subword vectors learned from the training data. So, an OOV word (whether typo or truly not-in-training-data) that shares lots of substrings with seen words winds up getting a very-similar vector.

    To the extent you might preprocess your original fragments to combine your original multigrams into single "words", according to some best guesses, the way that FastText still learns fragment-vectors might ensure you're still learning something about the subsegments.

    Also: the Phrases model in Gensim implements a statistical method for sometimes combining unigram tokens into pairs, based on the idea that certain pairs, if appearing together at a (configurable) statistically-notable rate, might be better modeled as a new combined bigram "word".

    The results aren't typically aesthetically-pleasing, nor do they match a human's sense of which word-groups are really logical-phrases, no matter how much the parameters are tuned. (Always, some unwanted pairs are combined, and wanted pairs are missed.)

    But, such combinations, warts and all, sometimes help the resulting text representations on objective evaluations of downstream tasks like classification or info-retrieval. (And, applying Phrases repeatedly can create de facto trigrams, quadgrams, etc.)