In my fortran code, matrix multiplication is handled with 'dgemm' in openblas library. The size of matrix is quite big, 7000 X 7000, so I want to reduce the computational cost in the matrix manipulation.
I tried to call 'dgemm' using multi-threads, but it seems not working (working as single thread only). The 'time' command is used to record the required time to calculate. Regardless I use -lpthreads flag or not, my calculation time is the same. It seems to me that the multi-threading is not working.
The below is my test.f and compile command. Can you recommend the way that I can use multi-threads in my matrix manipulation? Sorry about the duplication of questions and too simple and fundamental things, but the existing Q&As are not working for me. Thank you for any comments!
export OPENBLAS_LIB=/mypath/lib
export OPENBLAS_INC=/mypath/include
export OMP_NUM_THREADS=4
export GOTO_NUM_THREADS=4
export OPENBLAS_NUM_THREADS=4
gfortran test.f -o test.x -lopenblas -lpthread
sample source
program test
implicit none
integer :: i, j, k
integer :: m, n, num_threads
double precision :: alpha, s
double precision, allocatable :: aa(:,:), bb(:,:), cc(:,:)
call openblas_set_num_threads(4)
m=7000
allocate(aa(m,m))
allocate(bb(m,m))
allocate(cc(m,m))
aa=1.d0
bb=2.d0
cc=0.d0
write(*,*) 'initialization over'
! calculate matrix multiplication using library
alpha=1.d0
call dgemm('N', 'N', m, m, m, alpha, aa, m, bb, m, alpha, cc, m)
write(*,*) 'matrix multiplication over', cc(1,1), cc(m,m)
endprogram test
Whatever number of threads you are trying to set in OMP_NUM_THREADS
, OPENBLAS_NUM_THREADS
, MKL_NUM_THREADS
or whatever other environment variable, it does not matter at all. In your code you have
call openblas_set_num_threads(4)
and that has the priority and you will always get those 4 threads if at all possible.
The -lpthreads
is, as far as I understand it, useless. It is normally linked automatically and when you get no linker error it means it is not acually required to be linked explicitly.
In my tests with your code I always get around 17 seconds to run your code because of the call openblas_set_num_threads(4)
. When I changed it to one, I got 25 seconds. It is a simple laptop and other stuff is running. The important thing is that it also changes from 385% CPU to 99% CPU.
I use the default binary OpenBLAS included in OpenSUSE.