machine-learning nlp deep-learning embedding word2vec

How to use word embeddings/word2vec .. differently? With an actual, physical dictionary

If my title is incorrect/could be better, please let me know.

I've been trying to find an existing paper/article describing the problem that I'm having: I'm trying to create vectors for words so that they are equal to the sum of their parts. For example: Cardinal(the bird) would be equal to the vectors of: red, bird, and ONLY that. In order to train such a model, the input might be something like a dictionary, where each word is defined by it's attributes. Something like:

Cardinal: bird, red, ....

Bluebird: blue, bird,....

Bird: warm-blooded, wings, beak, two eyes, claws....

Wings: Bone, feather....

So in this instance, each word-vector is equal to the sum of the word-vector of its parts, and so on.

I understand that in the original word2vec, semantic distance was preserved, such that Vec(Madrid)-Vec(Spain)+Vec(Paris) = approx Vec(Paris).

Thanks!

PS: Also, if it's possible, new words should be able to be added later on.

Solution

If you're going to be building a dictionary of the components you want, you don't really need word2vec at all. You've already defined the dimensions you want specified: just use them, e.g. in Python:

kb = {"wings": {"bone", "feather"}, 
      "bird":  {"wings", "warm-blooded", ...}, ...}

Since the values are sets, you can do set intersection:

kb["bird"] | kb["reptile"]

You'll need to do find some ways decompose the elements recursively for comparisons, simplifications, etc. These are decisions you'll have to make based on what you expect to happen during such operations.

This sort of manual dictionary development is quite an old fashioned approach. Folks like Schank and Abelson used to do stuff like this in the 1970's. The problem is, as these dictionaries get more complex, they become intractable to maintain and more inaccurate in their approximations. You're welcome to try as an exercise---it can be kind of fun!---but keep your expectations low.

You'll also find aspects of meaning lost in these sorts of decompositions. One of word2vec's remarkable properties is its sensitives to the gestalt of words---words may have meaning that is composed of parts, but there's a piece in that composition that makes the whole greater than the sum of the parts. In a decomposition, the gestalt is lost.

Rather than trying to build a dictionary, you might be best off exploring what W2V gives you anyway, from a large corpus, and seeing how you can leverage that information to your advantage. The linguistics of what exactly W2V renders from text aren't wholly understood, but in trying to do something specific with the embeddings, you might learn something new about language.