Search code examples
pythonmultithreadingword2vecgensimdoc2vec

Gensim word2vec / doc2vec multi-threading parallel queries


I would like to call model.wv.most_similar_cosmul, on the same copy of model object, using multiple cores, on batches of input pairs.

The multiprocessing module requires multiple copies of model, which will require too much RAM because my model is 30+ GB in RAM.

I have tried to evaluate my query pairs. It took me ~12 hours for the first round. There may be more rounds coming. That's why I am looking for a threading solution. I understand Python has Global Interpreter Lock issue.

Any suggestions?


Solution

  • Forking off processes using multiprocessing after your text-vector model is in memory and unchanging might work to let many processes share the same object-in-memory.

    In particular, you'd want to be sure that the automatic generation of unit-normed vectors (into a syn0norm or doctag_syn0norm) has already happened. It'll be automatically triggered the first time it's needed by a most_similar() call, or you can force it with the init_sims() method on the relevant object. If you'll only be doing most-similar queries between unit-normed vectors, never needing the original raw vectors, use init_sims(replace=True) to clobber the raw mixed-magnitude syn0 vectors in-place and thus save a lot of addressable memory.

    Gensim also has options to use memory-mapped files as the sources of model giant arrays, and when multiple processes use the same read-only memory-mapped file, the OS will be smart enough to only map that file into physical memory once, providing both processes pointers to the shared array.

    For more discussion of the tricky parts of using this technique in a similar-but-not-identical use case, see my answer at:

    How to speed up Gensim Word2vec model load time?