I'm currently working on the Python implementation of a coupled HMM which involves the element-wise multiplication, dot product and sum of ndarrays of dimension (50,50,40,40,40) and ndarrays of dimension (40,40,40,40). That is to say, very large... The second ndarray is more or less sparse with about 70% of zeros.
I've been using numpy.einsum until now, and it has given me quite slow results (the algorithm takes about 3h to run). Now the problem is that I need to optimize some of the parameters for my HMM, which means that I will have to make at least 10000 runs, and thus I would need a speed increase of a factor at least 1000 in order to keep things reasonable.
I've been looking around to find the best way to do large arrays operations in Python, and now I'm quite lost with all that I have read. So I have several questions. Before asking them, I just want to specify that I'm using the last anaconda distribution with Python 3 on a machine with OSX, an intel processor and a nvidia GPU. Also, I believe that I can flatten my ndarrays to simplify my problem to a simple matrix-matrix problem.
1)It seems that BLAS/MKL libraries can give pretty good increases. It also seems that MKL is natively linked to Python when using Ananaconda with OSX. Therefore, have I been using MKL until now, without knowing it? np.config.show() gives me stuff like that:
openblas_lapack_info:
NOT AVAILABLE
blas_opt_info:
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['//anaconda/include']
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['//anaconda/lib']
lapack_opt_info:
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['//anaconda/include']
libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['//anaconda/lib']
lapack_mkl_info:
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['//anaconda/include']
libraries = ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['//anaconda/lib']
blas_mkl_info:
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['//anaconda/include']
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['//anaconda/lib']
mkl_info:
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['//anaconda/include']
libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
library_dirs = ['//anaconda/lib']
Does that really gives a speed increase with the einsum function? If not, how can I manage to get access to, for instance, the gemm function of MKL from Python?
2) Would any of these Python libraries be useful for my problem: Theanos, Numba, Cupy? Or simply Anaconda Accelerate? Would cuBlas, from Anaconda Accelerate, be the best option for me (namely that I'm wiling to use single or half float precision)?
3)Would that be useful recoding my algorithm in, for instance, C or C++?
4)I tried implementing my algorithm using sparse matrices (scipy.sparse.csc_matrix) but it made my program a lot slower. Was that expectable or did I make a mistake?
I know this makes many questions, but this is the first time I'm confronted to this kind of problem and the internet is not really clear on that...
Thanks a lot!
Ok so as I've been spending quite a lot of time these last days to investigate my problem, I may be able to give some answers to people as lost as me. So:
1) Yes, MKL is natively installed by default with anaconda (this hasn't always been the case though). However, np.einsum doesn't gain from it as it doesn't make use of the BLAS optimized dot numpy function. Therefore, trying to manually access the gemm function from mkl is useless (even if that's actually quite easy thanks to the library anaconda accelerate). It's way easier to rewrite the operations of np.einsum using np.tensordot, np.multiply and others.
2) I didn't have time to explore all of the libraries I asked about, but a tensordot Theanos operation is clearly not faster than a simple np.tensordot. Similarly, this wasn't the case a few years ago, because the np.tensordot operation wasn't using BLAS for multidimensionnal arrays. Concerning all the cuda libraries, it would seem that they are actually not that great comparing to MKL, except if you have a very powerful CGU (see for instance https://larsjuhljensen.wordpress.com/2011/01/28/commentary-the-gpu-computing-fallacy/ )
3) Rewriting some Python code optimized with MKL into C++ doesn't give much of an increase (see Benchmarking (python vs. c++ using BLAS) and (numpy) )
4) Still have no idea why my implementation with sparse matrices was slower than with dense matrices. It is very possible that I made something wrong.