Search code examples
fortranblasopenblas

OpenBLAS slower than intrinsic function dot_product


I need make a dot product in Fortran. I can do with the intrinsic function dot_product from Fortran or use ddot from OpenBLAS. The problem is the ddot is slower. This is my code:

With BLAS:

program VectorBLAS
! time VectorBlas.e = 0.30s
implicit none
double precision, dimension(3)  :: b
double precision                :: result
double precision, external      :: ddot
integer, parameter              :: LargeInt_K = selected_int_kind (18)
integer (kind=LargeInt_K)        :: I

DO I = 1, 10000000
   b(:) = 3
   result = ddot(3, b, 1, b, 1)
END DO
end program VectorBLAS

With dot_product

program VectorModule
! time VectorModule.e = 0.19s
implicit none
double precision, dimension (3)  :: b
double precision                 :: result
integer, parameter              :: LargeInt_K = selected_int_kind (18)
integer (kind=LargeInt_K)        :: I

DO I = 1, 10000000
  b(:) = 3
  result = dot_product(b, b)
END DO
end program VectorModule

The two codes are compiled using:

gfortran file_name.f90 -lblas -o file_name.e

What am I doing wrong? BLAS not have to be faster?


Solution

  • While BLAS, and especially the optimized versions, are generally faster for larger arrays, the built-in functions are faster for smaller sizes.

    This is especially visible from the linked source code of ddot, where additional work is spent on further functionality (e.g., different increments). For small array lengths, the work done here outweighs the performance gain of the optimizations.

    If you make your vectors (much) larger, the optimized version should be faster.

    Here is an example to illustrate this:

    program test
      use, intrinsic :: ISO_Fortran_env, only: REAL64
      implicit none
      integer                   :: t1, t2, rate, ttot1, ttot2, i
      real(REAL64), allocatable :: a(:),b(:),c(:)
      real(REAL64), external    :: ddot
    
      allocate( a(100000), b(100000), c(100000) )
      call system_clock(count_rate=rate)
    
      ttot1 = 0 ; ttot2 = 0
      do i=1,1000
        call random_number(a)
        call random_number(b)
    
        call system_clock(t1)
        c = dot_product(a,b)
        call system_clock(t2)
        ttot1 = ttot1 + t2 - t1
    
        call system_clock(t1)
        c = ddot(100000,a,1,b,1)
        call system_clock(t2)
        ttot2 = ttot2 + t2 - t1
      enddo
      print *,'dot_product: ', real(ttot1)/real(rate) 
      print *,'BLAS, ddot:  ', real(ttot2)/real(rate) 
    end program
    

    The BLAS routines are quite a bit faster here:

    OMP_NUM_THREADS=1 ./a.out 
     dot_product:   0.145999998    
     BLAS, ddot:    0.100000001