Search code examples
pythonnlpgensimword2vecfasttext

Gensim sort_by_descending_frequency changes most_similar results


It seems that when retrieving the most similar word vectors, sorting by word frequency will change the results in Gensim.

Before sorting:

from gensim.models import FastText
from gensim.test.utils import common_texts  # some example sentences
print(len(common_texts))
model = FastText(vector_size=4, window=3, min_count=1)  # instantiate
model.build_vocab(corpus_iterable=common_texts)
model.train(corpus_iterable=common_texts, total_examples=len(common_texts), epochs=1)  

model.wv.most_similar(positive=["human"])
[('interface', 0.7432922720909119),
 ('minors', 0.6719315052032471),
 ('time', 0.3513716757297516),
 ('computer', 0.05815044790506363),
 ('response', -0.11714297533035278),
 ('graph', -0.15643596649169922),
 ('eps', -0.2679084539413452),
 ('survey', -0.34035828709602356),
 ('trees', -0.63677978515625),
 ('user', -0.6500451564788818)]

However, if I sort the vectors by descending frequency:

model.wv.sort_by_descending_frequency()

model.wv.most_similar(positive=["human"])
[('minors', 0.9638221263885498),
 ('time', 0.6335864067077637),
 ('interface', 0.40014874935150146),
 ('computer', 0.03224882856011391),
 ('response', -0.14850640296936035),
 ('graph', -0.2249641716480255),
 ('survey', -0.26847705245018005),
 ('user', -0.45202943682670593),
 ('eps', -0.497650682926178),
 ('trees', -0.6367797255516052)]

The most similar word ranking as well as the word similarities change. Any idea why?

Update:

Before calling sort:

model.wv.index_to_key
['system',
 'graph',
 'trees',
 'user',
 'minors',
 'eps',
 'time',
 'response',
 'survey',
 'computer',
 'interface',
 'human']
model.wv.expandos['count']

array([4, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2])

After calling sort:

model.wv.index_to_key
['system',
 'user',
 'trees',
 'graph',
 'human',
 'interface',
 'computer',
 'survey',
 'response',
 'time',
 'eps',
 'minors']
model.wv.expandos['count']

array([4, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2])


Solution

  • That change-of-reported similarities definitely shouldn't happen, so something is surely going wrong here. (Maybe, cached-subword info isn't re-sorting.)

    But also note:

    • That method wasn't particularly meant for use after training - indeed, you should be seeing a warning message if using it that way.
    • Such a sort should already happen by default in all 2Vec algorithms at the end of vocab-discovery phase - it's the usual behavior, only rarely turned off. So requesting it again should at most be a no-op.

    To dig into what may be gowing wrong, can you edit your question to show the values of both…

    • model.wv.index_to_key
    • model.wv.expandos['count']

    …before and after the . sort_by_descending_frequency() call?