Search code examples
python-2.7word2vecgensimblasdoc2vec

Optimizing gensim(C compilier and BLAS) in Window 7


I wants to optimize gensim to run doc2vec in Window7

[1] C compiler

I installed gensim by following this instruction: https://radimrehurek.com/gensim/install.html

pip install --upgrade gensim

However, in this page(https://radimrehurek.com/gensim/models/doc2vec.html), it is saying that C compiler is needed before installing gensim.

Make sure you have a C compiler before installing gensim, to use optimized (compiled) doc2vec training (70x speedup [blog]).

  1. Should I do something before using pip?

[2] BLAS

In the tutorial, https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb it is saying that

Time to Train

If the BLAS library is being used, this should take no more than 3 seconds. If the BLAS library is not being used, this should take no more than 2 minutes, so use BLAS if you value your time.

So it seems like I have to install BLAS for optimization, but I have no idea what BLAS is and there are little and complex BLAS installation guides for window.

  1. Which BLAS library should I install for running gensim in Window?
  2. If I install BLAS library, will it be automatically linked to python code when I am running gensim doc2vec? or should I do something to link it to doc2vec code?

Solution

  • It's not just BLAS that gensim's optimized code needs, but native-compiled libraries based on Cython code.

    If at all possible, this kind of work should be done on UNIX-like systems (Linux/MacOS), because that's where most of the open-source libraries are most developed, tested, and used. So you'll be closer to the system configurations of the primary developers, and larger user community – meaning default installation instructions are more likely to "just work", and any problems you run into are more likely to have existing answers in findable places.

    But if you're trapped using Windows, the 'conda' distribution of Python generally does a good job of installing optimized versions of the key libraries on Windows, so it can be a good choice. I especially like to start with the 'miniconda' variant, so that only the exact packages I explicitly need are installed into an environment.

    The Miniconda installation instructions and getting-started-guide are both quite good. Generally once you are in a conda environment you can conda install PACKAGENAME for major foundational packages like numpy or scipy, and still choose to pip install PACKAGENAME for anything that's not in the conda repositories, or not as up-to-date in the conda repositories. (Sometimes it makes sense to get gensim from pip even if otherwise using a conda-based environment.)