Search code examples
pythonnlpnltkgensimwordnet

improve gensim most_similar() return values by using wordnet hypernyms


import gensim.downloader as api
glove = api.load('glove-wiki-gigaword-200')

I first ran this code to download the pre-trained model.

glove.most_similar(positive=['sushi', 'uae'], negative=['japan'])

would then result in:

[('nahyan', 0.5181387066841125),
 ('caviar', 0.4778318405151367),
 ('paella', 0.4497394263744354),
 ('nahayan', 0.44313961267471313),
 ('zayed', 0.4321245849132538),
 ('omani', 0.4285220503807068),
 ('seafood', 0.4279175102710724),
 ('saif', 0.426000714302063),
 ('dirham', 0.4214130640029907),
 ('sashimi', 0.4165934920310974)]

and in this example, we can see that the method failed to capture the 'type' or 'category' of the query. 'zayed', 'nahyan' are not actually of 'type' food and rather they represent person name.

The approach suggested by my professor is to use wordnet hypernyms to find the 'type'.

With much research, the closest solution I found is to somehow incorporate lowest_common_hypernyms() that will give the lowest common hypernym between two synsets and use it to filter the results of most_similar().

I am not sure if my idea make sense and would like the community feedback on this.

My idea is compute the hypernym of, e.g. 'sushi' and the hypernyms of all the similar words returned by most_similar() and only choose the word with 'longest' lowest common hypernym path. I expect this should return the word that best matches the 'type'

Not sure if it makes sense...


Solution

  • Does your proposed approach give adequate results when you try it?

    That's the only test of whether the idea makes sense.

    Word2vec is generally oblivious to the all the variations of category that a lexicon like WordNet can provide – all the words that are similar to another word, in any aspect, will be neighbors. Even words that people consider opposites – like 'hot' and 'cold' – will be often be fairly close to each other, in some direction in the coordinate space, as they are similar in what they describe and what contexts they're used in. (They can be drop-in replacements for each other.)

    Word2vec is also fairly oblivious to polysemy in its standard formulation.

    Some other things worth trying might be:

    • if you need only answers of a certain type, mix-in some measurement ranking candidate answers by their closeness to a word either describing that type ('food') or representing multiple examples (say an average vector for many food-names you'd know to be good answers)
    • choose another vector-set, or train your own. There's no universal "goodness" for word-vectors: their quality for certain tasks will vary based on their training data & parameters. Vectors trained on something broader than Wikipedia (your named vector file), or some text corpus more focused on your domain-of-interest – say, food criticism – might do better on some tasks. Changing training parameters can also change which kinds of similarity are most emphasized in the resulting vectors. For example, some observers have noticed small context-windows tend to put words that are direct drop-in replacements for each other closer-together, while larger context-windows bring words from the same domains-of-use, even if not drop-in replacements of the same 'type', closer. (It sounds like your current need might be best served with a model trained with smaller windows.)