Search code examples
pythonmachine-learningnlpgensimfasttext

How to run Fasttext get_nearest_neighbors() faster?


I'm trying to extract morphs/similar words in Sinhala language using Fasttext. But FastText takes a 1 second for 2.64 words. How can I increase the speed without changing the model size?

My code looks like this:

import fasttext
fasttext.util.download_model('si', if_exists='ignore')  # Sinhala
ft = fasttext.load_model('cc.si.300.bin')
words_file = open(r'/Datasets/si_words_filtered.txt')
words = words_file.readlines()
words = words[0:300]
synon_dict = dict()
from tqdm import tqdm_notebook
for i in tqdm_notebook(range(len(words))):
    word = words[i].strip()
    synon = ft.get_nearest_neighbors(word)[0][1] ### takes a lot of time
    if is_strictly_sinhala_word(synon):
        synon_dict[word] = synon
import json
with open("out.json", "w", encoding='utf8') as f:
    json.dump(synon_dict, f, ensure_ascii=False)


Solution

  • To do a fully accurate get_nearest_neighbors()-type of calculation is inherently fairly expensive, requiring a lookup & calculation against every word in the set, for each new word.

    As it looks like that set of vectors is near or beyond 2GB in size, when just the word-vectors are loaded, that means a scan of 2GB of addressable memory may be the dominant factor in the runtime.

    Some things to try that might help:

    • Ensure that you have plenty of RAM - if there's any use of 'swap'/virtual-memory, that will make things far slower.
    • Avoid all unnecessary comparisons - for example, perform your is_strictly_sinhala_word() check before the expensive step, so you can skip the costly step if not interested in the results. Also, you could consider shrinking the full set of word-vectors to eliminate those that you are unlikely to want as responses. This might involve throwing out words you know are not of the language-of-interest, or all lower-frequency words. (If you can throw out half the words as possible nearest-neighbors before even trying the get_nearest_neighbors(), it will go roughly twice as fast.) More on these options below.
    • Try other word-vector libraries, to see if they offer any improvement. For example, the Python Gensim project can load either plain sets of full-word vectors (eg, the cc.si.300.vec words-only file) or FastText models (the .bin file), and offers a .most_similar() function that has some extra options & might, in some cases, offer different performance. (Though, the official Facebook Fasttext .get_nearest_neighbors() is probably pretty good.)
    • Use an "approximate nearest neighbors" library to pre-build an index of the word-vector space that can then offer extra-fast nearest-neighbor lookups - although at some risk of not finding the exact right top-N neighbors. There are many such libraries – see this benchmarking project that compares over 20 of them. But, adding this step complicates things & the tradeoff of that complexity & the imperfect result may not be worth the effort & time-savings. So, just remember that it's a possibility if your need s large enough & nothing else helps.

    With regard to slimming the set of vectors searched:

    • The Gensim KeyedVectors.load_word2vec_format() function, which can load the .vec words-only file, has an option limit that will only read the specified number of words from the file. It looks like the .vec file for your dataset has over 800k words - but if you chose to load only 400k, your .most_similar() calculations would go about twice as fast. (And, since such files typically front-load the files with the most-common words, the loss of the far-rarer words may not be a concern.)
    • Siilarly, even if you load all the vectors, the Gensim .most_similar() function has a restrict_vocab option that can limit searches to just the 1st words of that count, which could also speed things or helpfully drop obscure words that may be of less interest.
    • The .vec file may be easier to work with if you wanted to pre-filter the words to, for example, eliminate non-Sinhala words. (Note: the usual .load_word2vec_format() text format needs a 1st line that declares the count of words & word-dimensionality, but you may leave that off, then load using the no_header=True option, which instead uses 2 full passes over the file to get the count.)