What's the state of the art with regards to getting numpy
to use mutliple cores (on Intel hardware) for things like inner and outer vector products, vector-matrix multiplications etc?
I am happy to rebuild numpy
if necessary, but at this point I am looking at ways to speed things up without changing my code.
For reference, my show_config()
is as follows, and I've never observed numpy
to use more than one core:
atlas_threads_info:
libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/local/atlas-3.9.16/lib']
language = f77
include_dirs = ['/usr/local/atlas-3.9.16/include']
blas_opt_info:
libraries = ['ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/local/atlas-3.9.16/lib']
define_macros = [('ATLAS_INFO', '"\\"3.9.16\\""')]
language = c
include_dirs = ['/usr/local/atlas-3.9.16/include']
atlas_blas_threads_info:
libraries = ['ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/local/atlas-3.9.16/lib']
language = c
include_dirs = ['/usr/local/atlas-3.9.16/include']
lapack_opt_info:
libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
library_dirs = ['/usr/local/atlas-3.9.16/lib']
define_macros = [('ATLAS_INFO', '"\\"3.9.16\\""')]
language = f77
include_dirs = ['/usr/local/atlas-3.9.16/include']
lapack_mkl_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
mkl_info:
NOT AVAILABLE
You should probably start by checking whether the Atlas build that numpy is using has been built with multi-threading. You can build and run this to inspect the Atlas configuration (straight from the Atlas FAQ):
main()
/*
* Compile, link and run with something like:
* gcc -o xprint_buildinfo -L[ATLAS lib dir] -latlas ; ./xprint_buildinfo
* if link fails, you are using ATLAS version older than 3.3.6.
*/
{
void ATL_buildinfo(void);
ATL_buildinfo();
exit(0);
}
If you have don't have a multithreaded version of Atlas: "there's your problem". If it is multithreaded, then you need to exercise one of the multithreaded BLAS3 routines (probably dgemm), with a suitably large matrix-matrix product and see whether threading is used. I think I am right in saying that neither BLAS 2 and BLAS 1 routines in Atlas support multithreading (and with good reason because there is no performance advantage except at truly enormous problem sizes).