I'm trying to extract morphs/similar words in Sinhala language using Fasttext. But FastText takes a 1 second for 2.64 words. How can I increase the speed without changing the model size?
My code looks like this:
import fasttext
fasttext.util.download_model('si', if_exists='ignore') # Sinhala
ft = fasttext.load_model('cc.si.300.bin')
words_file = open(r'/Datasets/si_words_filtered.txt')
words = words_file.readlines()
words = words[0:300]
synon_dict = dict()
from tqdm import tqdm_notebook
for i in tqdm_notebook(range(len(words))):
word = words[i].strip()
synon = ft.get_nearest_neighbors(word)[0][1] ### takes a lot of time
if is_strictly_sinhala_word(synon):
synon_dict[word] = synon
import json
with open("out.json", "w", encoding='utf8') as f:
json.dump(synon_dict, f, ensure_ascii=False)
To do a fully accurate get_nearest_neighbors()
-type of calculation is inherently fairly expensive, requiring a lookup & calculation against every word in the set, for each new word.
As it looks like that set of vectors is near or beyond 2GB in size, when just the word-vectors are loaded, that means a scan of 2GB of addressable memory may be the dominant factor in the runtime.
Some things to try that might help:
is_strictly_sinhala_word()
check before the expensive step, so you can skip the costly step if not interested in the results. Also, you could consider shrinking the full set of word-vectors to eliminate those that you are unlikely to want as responses. This might involve throwing out words you know are not of the language-of-interest, or all lower-frequency words. (If you can throw out half the words as possible nearest-neighbors before even trying the get_nearest_neighbors()
, it will go roughly twice as fast.) More on these options below.cc.si.300.vec
words-only file) or FastText models (the .bin
file), and offers a .most_similar()
function that has some extra options & might, in some cases, offer different performance. (Though, the official Facebook Fasttext .get_nearest_neighbors()
is probably pretty good.)With regard to slimming the set of vectors searched:
KeyedVectors.load_word2vec_format()
function, which can load the .vec
words-only file, has an option limit
that will only read the specified number of words from the file. It looks like the .vec
file for your dataset has over 800k words - but if you chose to load only 400k, your .most_similar()
calculations would go about twice as fast. (And, since such files typically front-load the files with the most-common words, the loss of the far-rarer words may not be a concern.).most_similar()
function has a restrict_vocab
option that can limit searches to just the 1st words of that count, which could also speed things or helpfully drop obscure words that may be of less interest..vec
file may be easier to work with if you wanted to pre-filter the words to, for example, eliminate non-Sinhala words. (Note: the usual .load_word2vec_format()
text format needs a 1st line that declares the count of words & word-dimensionality, but you may leave that off, then load using the no_header=True
option, which instead uses 2 full passes over the file to get the count.)