Search code examples
pythonnlpgensimword2vecfasttext

Genesis most_similar find synonym only (not antonyms)


Is there a way to let model.wv.most_similar in gensim return positive-meaning words only (i.e. that shows synonyms but not antonyms)?

For example, if I do:

import fasttext.util
from gensim.models.fasttext import load_facebook_model
from gensim.models.fasttext import FastTextKeyedVectors
fasttext.util.download_model('en', if_exists='ignore')  # English
model = load_facebook_model('cc.en.300.bin')
model.wv.most_similar(positive=['honest'], topn=2000)

Then the mode is also going to return words such as "dishonest".

('dishonest', 0.5542981028556824),

However, what if I want words with the positive-meaning only?

I have tried the following - subtracting "not" from "honest" in the vector space:

import fasttext.util
from gensim.models.fasttext import load_facebook_model
from gensim.models.fasttext import FastTextKeyedVectors
fasttext.util.download_model('en', if_exists='ignore')  # English
model = load_facebook_model('cc.en.300.bin')
model.wv.most_similar(positive=['honest'], negative=['not'], topn=2000)

But somehow it is still returning "dishonest" somehow.

('dishonest', 0.23721608519554138)
('dishonesties', 0.16536088287830353)

Any idea how to do this in a better way?


Solution

  • Unfortunately, the vector-space created by word2vec algorithm training doesn't neatly match our human, intuitive understandin of pure-synonymity.

    Rather, word2vec's sense of 'similarity' is more general - and overall, antonyms tend to be quite similar to each other: they're used in similar contexts (the driving force of word2vec training), about the same topics.

    And further even though many understandable contrasts do vaguely correlate with various directions, there is no universal "opposite" (or "positive") direction. So composing 'not' with a word doesn't neatly invert the dominant sense of a word, and 'honest' + 'not' won't reliably help find the direction of 'dishonest'.

    Barring finding some extra technique for this task beyond basic word2vec (in other research literature or via your own experimentation), the best you may be able to do is using already-known unwanted answers to further refine the results. That is, something like the following might offer marginally-improved results:

    word_vecs.most_similar(positive=['honest'], negative=['dishonest'])
    

    (Further expanding the examples with more related words, either of the kind you want or not, might also help.)

    See also some of the comments & links in a previous answer for more ideas: https://stackoverflow.com/a/44491124/130288