Search code examples
multithreadingmatrixfortranopenblas

Can openblas do matrix multiplication with multi-threads in fortran?


In my fortran code, matrix multiplication is handled with 'dgemm' in openblas library. The size of matrix is quite big, 7000 X 7000, so I want to reduce the computational cost in the matrix manipulation.

I tried to call 'dgemm' using multi-threads, but it seems not working (working as single thread only). The 'time' command is used to record the required time to calculate. Regardless I use -lpthreads flag or not, my calculation time is the same. It seems to me that the multi-threading is not working.

The below is my test.f and compile command. Can you recommend the way that I can use multi-threads in my matrix manipulation? Sorry about the duplication of questions and too simple and fundamental things, but the existing Q&As are not working for me. Thank you for any comments!

  • In bashrc :

export OPENBLAS_LIB=/mypath/lib

export OPENBLAS_INC=/mypath/include

export OMP_NUM_THREADS=4

export GOTO_NUM_THREADS=4

export OPENBLAS_NUM_THREADS=4

  • command for source :

gfortran test.f -o test.x -lopenblas -lpthread

  • sample source

      program test
    
      implicit none
    
      integer :: i, j, k
      integer :: m, n, num_threads
      double precision :: alpha, s
      double precision, allocatable :: aa(:,:), bb(:,:), cc(:,:)
    
      call openblas_set_num_threads(4)
    
      m=7000
    
      allocate(aa(m,m))
      allocate(bb(m,m))
      allocate(cc(m,m))
      aa=1.d0
      bb=2.d0
      cc=0.d0
    
      write(*,*) 'initialization over'
    
      ! calculate matrix multiplication using library
      alpha=1.d0
      call dgemm('N', 'N', m, m, m, alpha, aa, m, bb, m, alpha, cc, m)
    
      write(*,*) 'matrix multiplication over', cc(1,1), cc(m,m)
    
      endprogram test
    

Solution

  • Whatever number of threads you are trying to set in OMP_NUM_THREADS, OPENBLAS_NUM_THREADS, MKL_NUM_THREADS or whatever other environment variable, it does not matter at all. In your code you have

    call openblas_set_num_threads(4)
    

    and that has the priority and you will always get those 4 threads if at all possible.

    The -lpthreads is, as far as I understand it, useless. It is normally linked automatically and when you get no linker error it means it is not acually required to be linked explicitly.

    In my tests with your code I always get around 17 seconds to run your code because of the call openblas_set_num_threads(4). When I changed it to one, I got 25 seconds. It is a simple laptop and other stuff is running. The important thing is that it also changes from 385% CPU to 99% CPU.

    I use the default binary OpenBLAS included in OpenSUSE.