OpenMP with BLAS

I tried to extend the code in the answer of the above link, to include cross checks and openmp.

Program reshape_for_blas

  Use, Intrinsic :: iso_fortran_env, Only :  wp => real64, li => int64

  Implicit None

  Real( wp ), Dimension( :, :    ), Allocatable :: a
  Real( wp ), Dimension( :, :, : ), Allocatable :: b
  Real( wp ), Dimension( :, :, : ), Allocatable :: c1, c2, c3, c4, c5
  Real( wp ), Dimension( :, :    ), Allocatable :: d
  Real( wp ), Dimension( :, :    ), Allocatable :: e

  Integer :: na, nb, nc, nd, ne
  Integer :: la, lb, lc, ld  
  Integer( li ) :: start, finish, rate, numthreads

  numthreads = 2

  call omp_set_num_threads(numthreads)  
  
  
  Write( *, * ) 'na, nb, nc, nd ?'
  Read( *, * ) na, nb, nc, nd
  ne = nc * nd
  Allocate( a ( 1:na, 1:nb ) ) 
  Allocate( b ( 1:nb, 1:nc, 1:nd ) ) 
  Allocate( c1( 1:na, 1:nc, 1:nd ) ) 
  Allocate( c2( 1:na, 1:nc, 1:nd ) ) 
  Allocate( c3( 1:na, 1:nc, 1:nd ) )
  Allocate( c4( 1:na, 1:nc, 1:nd ) )
  Allocate( c5( 1:na, 1:nc, 1:nd ) )  
  Allocate( d ( 1:nb, 1:ne ) ) 
  Allocate( e ( 1:na, 1:ne ) ) 

  ! Set up some data
  Call Random_number( a )
  Call Random_number( b )

  ! With reshapes
  Call System_clock( start, rate )
  
  !write (*,*) 'clock', start, rate
  
  
  d = Reshape( b, Shape( d ) )
  Call dgemm( 'N', 'N', na, ne, nb, 1.0_wp, a, Size( a, Dim = 1 ), &
                                            d, Size( d, Dim = 1 ), &
                                    0.0_wp, e, Size( e, Dim = 1 ) )
  c1 = Reshape( e, Shape( c1 ) )
  Call System_clock( finish, rate )
  
  !write (*,*) 'clock', finish, rate
  
  
  
  Write( *, * ) 'Time for reshaping method ', Real( finish - start, wp ) / rate
  Write( *, * ) 'Difference between result matrices ', Maxval( Abs( c1 - c2 ) )
  
  
  ! Direct
  Call System_clock( start, rate )
  Call dgemm( 'N', 'N', na, ne, nb, 1.0_wp, a , Size( a , Dim = 1 ), &
                                            b , Size( b , Dim = 1 ), &
                                            0.0_wp, c2, Size( c2, Dim = 1 ) )
  Call System_clock( finish, rate )
  Write( *, * ) 'Time for straight  method ', Real( finish - start, wp ) / rate
    
    
  
  Call System_clock( start, rate )
  
  !$omp parallel
  ! Direct
  Call dgemm( 'N', 'N', na, ne, nb, 1.0_wp, a , Size( a , Dim = 1 ), &
                                            b , Size( b , Dim = 1 ), &
                                            0.0_wp, c4, Size( c4, Dim = 1 ) )
  !$omp end parallel
  Call System_clock( finish, rate )  
  Write( *, * ) 'Time for straight  method omp', Real( finish - start, wp ) / rate
    
  
 
  !naive
  Call System_clock( start, rate )

  do la = 1, na 
    do lc = 1, nc
      do ld = 1, nd
         c3(la,lc,ld) = 0.0_wp
      enddo
    enddo
  enddo
  
  do la = 1, na 
    do lb = 1, nb
      do lc = 1, nc
        do ld = 1, nd
          c3(la,lc,ld) = c3(la,lc,ld)  + a(la,lb) * b(lb, lc, ld)
        enddo  
      enddo
    enddo
  enddo  
 
  
  Call System_clock( finish, rate )
  Write( *, * ) 'Time for loop', Real( finish - start, wp ) / rate  
 
   

  !naive omp
  Call System_clock( start, rate )  
  !$omp parallel

  do la = 1, na 
    do lc = 1, nc
      do ld = 1, nd
         c5(la,lc,ld) = 0.0_wp
      enddo
    enddo
  enddo
  
  !$omp do private(la, lb, lc, ld) schedule(dynamic) reduction(+: c5)    
  do la = 1, na 
    do lb = 1, nb
      do lc = 1, nc
        do ld = 1, nd
          c5(la,lc,ld) = c5(la,lc,ld)  + a(la,lb) * b(lb, lc, ld)
        enddo  
      enddo
    enddo
  enddo  
  !$omp end do  
  !$omp end parallel
 
  
  Call System_clock( finish, rate )
  Write( *, * ) 'Time for loop omp', Real( finish - start, wp ) / rate  
 
  
  
  
  
  do la = 1, na 
    do lc = 1, nc
      do ld = 1, nd
         
         if ( dabs(c3(la,lc,ld) - c1(la,lc,ld))  > 1.e-6 ) then 
           write (*,*) '!!! c1', c3(la,lc,ld) - c1(la,lc,ld)
         endif         
         
         
         if ( dabs(c3(la,lc,ld) - c2(la,lc,ld))  > 1.e-6 ) then 
           write (*,*) '!!! c2', c3(la,lc,ld) - c2(la,lc,ld)
         endif
         
         if ( dabs(c3(la,lc,ld) - c4(la,lc,ld))  > 1.e-6 ) then 
           write (*,*) '!!! c4', la,lc,ld, c3(la,lc,ld) - c4(la,lc,ld)
         endif
         
         if ( dabs(c3(la,lc,ld) - c5(la,lc,ld))  > 1.e-6 ) then 
           write (*,*) '!!! c5', la,lc,ld, c3(la,lc,ld) - c5(la,lc,ld)
         endif         
      enddo
    enddo
  enddo  
  
End Program reshape_for_blas

I got two issues:

there is no significant speed up in neither BLAS or naive loops. E.g,., by gfortran -std=f2008 -Wall -Wextra -fcheck=all reshape.f90 -lblas -fopenmp, and input 30 30 30 60, I got

30 30 30 60
 Time for reshaping method    2.9443999999999998E-003
 Difference between result matrices    12.380937791257775
 Time for straight  method    1.0016000000000001E-003
 Time for straight  method omp   2.4878000000000001E-003
 Time for loop   6.6072500000000006E-002
 Time for loop omp  0.100242600000000002

when the dimension get larger, e.g., 60 60 60 60 in the input, the openmp BLAS result can get different value than naive loop, seems I miss some control option.

What would be the problems with OpenMP here?

Edit I added omp lines in the initialization in the c5 section and commented out two printing lines,


Program reshape_for_blas

  Use, Intrinsic :: iso_fortran_env, Only :  wp => real64, li => int64

  Implicit None

  Real( wp ), Dimension( :, :    ), Allocatable :: a
  Real( wp ), Dimension( :, :, : ), Allocatable :: b
  Real( wp ), Dimension( :, :, : ), Allocatable :: c1, c2, c3, c4, c5
  Real( wp ), Dimension( :, :    ), Allocatable :: d
  Real( wp ), Dimension( :, :    ), Allocatable :: e

  Integer :: na, nb, nc, nd, ne
  Integer :: la, lb, lc, ld  
  Integer( li ) :: start, finish, rate, numthreads

  numthreads = 2

  call omp_set_num_threads(numthreads)  
  
  
  Write( *, * ) 'na, nb, nc, nd ?'
  Read( *, * ) na, nb, nc, nd
  ne = nc * nd
  Allocate( a ( 1:na, 1:nb ) ) 
  Allocate( b ( 1:nb, 1:nc, 1:nd ) ) 
  Allocate( c1( 1:na, 1:nc, 1:nd ) ) 
  Allocate( c2( 1:na, 1:nc, 1:nd ) ) 
  Allocate( c3( 1:na, 1:nc, 1:nd ) )
  Allocate( c4( 1:na, 1:nc, 1:nd ) )
  Allocate( c5( 1:na, 1:nc, 1:nd ) )  
  Allocate( d ( 1:nb, 1:ne ) ) 
  Allocate( e ( 1:na, 1:ne ) ) 

  ! Set up some data
  Call Random_number( a )
  Call Random_number( b )

  ! With reshapes
  Call System_clock( start, rate )
  
  !write (*,*) 'clock', start, rate
  
  
  d = Reshape( b, Shape( d ) )
  Call dgemm( 'N', 'N', na, ne, nb, 1.0_wp, a, Size( a, Dim = 1 ), &
                                            d, Size( d, Dim = 1 ), &
                                    0.0_wp, e, Size( e, Dim = 1 ) )
  c1 = Reshape( e, Shape( c1 ) )
  Call System_clock( finish, rate )
  
  !write (*,*) 'clock', finish, rate
  
  
  
  Write( *, * ) 'Time for reshaping method ', Real( finish - start, wp ) / rate
  Write( *, * ) 'Difference between result matrices ', Maxval( Abs( c1 - c2 ) )
  
  
  ! Direct
  Call System_clock( start, rate )
  Call dgemm( 'N', 'N', na, ne, nb, 1.0_wp, a , Size( a , Dim = 1 ), &
                                            b , Size( b , Dim = 1 ), &
                                            0.0_wp, c2, Size( c2, Dim = 1 ) )
  Call System_clock( finish, rate )
  Write( *, * ) 'Time for straight  method ', Real( finish - start, wp ) / rate
    
    

  !naive loop
  Call System_clock( start, rate )

  do la = 1, na 
    do lc = 1, nc
      do ld = 1, nd
         c3(la,lc,ld) = 0.0_wp
      enddo
    enddo
  enddo
  
  do la = 1, na 
    do lb = 1, nb
      do lc = 1, nc
        do ld = 1, nd
          c3(la,lc,ld) = c3(la,lc,ld)  + a(la,lb) * b(lb, lc, ld)
        enddo  
      enddo
    enddo
  enddo  
 
  
  Call System_clock( finish, rate )
  Write( *, * ) 'Time for loop', Real( finish - start, wp ) / rate  
   



  !dgemm omp 
  Call System_clock( start, rate )
  
  !$omp parallel
  ! Direct
  Call dgemm( 'N', 'N', na, ne, nb, 1.0_wp, a , Size( a , Dim = 1 ), &
                                            b , Size( b , Dim = 1 ), &
                                            0.0_wp, c4, Size( c4, Dim = 1 ) )
  !$omp end parallel
  Call System_clock( finish, rate )  
  Write( *, * ) 'Time for straight  method omp', Real( finish - start, wp ) / rate
    
  
 

  !loop omp
  Call System_clock( start, rate )  
  !$omp parallel

  do la = 1, na 
    do lc = 1, nc
      do ld = 1, nd
         c5(la,lc,ld) = 0.0_wp
      enddo
    enddo
  enddo
  
  !$omp do private(la, lb, lc, ld) schedule(dynamic) reduction(+: c5)    
  do la = 1, na 
    do lb = 1, nb
      do lc = 1, nc
        do ld = 1, nd
          c5(la,lc,ld) = c5(la,lc,ld)  + a(la,lb) * b(lb, lc, ld)
        enddo  
      enddo
    enddo
  enddo  
  !$omp end do  
  !$omp end parallel
 
  
  Call System_clock( finish, rate )
  Write( *, * ) 'Time for loop omp', Real( finish - start, wp ) / rate  
 
  


!single core: c1 c2 c3
! c1 reshape blas
! c2 blas
! c3 naive loop (reference)
! parallel: c4 c5
! c4 dgemm parallel
! c5 naive loop parallel


  do la = 1, na 
    do lc = 1, nc
      do ld = 1, nd
         
         if ( dabs(c3(la,lc,ld) - c1(la,lc,ld))  > 1.e-6 ) then 
           write (*,*) '!!! c1', c3(la,lc,ld) - c1(la,lc,ld)
         endif         
         
         
         if ( dabs(c3(la,lc,ld) - c2(la,lc,ld))  > 1.e-6 ) then 
           write (*,*) '!!! c2', c3(la,lc,ld) - c2(la,lc,ld)
         endif
         
         if ( dabs(c3(la,lc,ld) - c4(la,lc,ld))  > 1.e-6 ) then 
           write (*,*) '!!! c4', la,lc,ld, c3(la,lc,ld) - c4(la,lc,ld)
         endif
         
         if ( dabs(c3(la,lc,ld) - c5(la,lc,ld))  > 1.e-6 ) then 
           write (*,*) '!!! c5', la,lc,ld, c3(la,lc,ld) - c5(la,lc,ld)
         endif         
      enddo
    enddo
  enddo  
  
End Program reshape_for_blas

then gfortran reshape.f90 -lblas -fopenmp , and 30 30 30 30 input lead to

 Time for reshaping method    1.3519000000000001E-003
 Difference between result matrices    12.380937791257775
 Time for straight  method    6.2549999999999997E-004
 Time for straight  method omp   1.2600000000000001E-003
 Time for naive loop   1.0008599999999999E-002
 Time for naive loop omp   1.6678999999999999E-002

not good speed up though.

Solution

You are calling DGEMM in parallel using the same set of variables (because variables in parallel regions are shared by default in Fortran). This doesn't work and produces weird results due to data races. You have two options:

Find a parallel BLAS implementation where DGEMM is already threaded. Intel MKL and OpenBLAS are prime candidates. Intel MKL uses OpenMP, more specifically it is built with Intel OpenMP runtime, so it may not play very nicely with OpenMP code compiled with GCC, but it works perfectly with non-threaded code.
Call DGEMM in parallel but not with the same set of arguments. Instead, perform block decomposition of one or both tensors and have each thread do the contraction for a separate block. Since Fortran uses column-major storage, it may be appropriate to decompose the second tensor:
```
C[i,k,l=1..L] = A[i,j] * B[j,k,l=1..L]
```
becomes with two threads:
```
thread 0: C[i,k,l=1..L/2] = A[i,j] * B[j,k,l=1..L/2]
thread 1: C[i,k,l=L/2+1..L] = A[i,j] * B[j,k,l=L/2+1..L]
```
With an arbitrary number of threads it boils down to computing the start and end values for the l index in each thread and adjusting the arguments of DGEMM accordingly.

Personally, I'd go with a parallel BLAS implementation. With Intel MKL, you only need to link in the parallel driver and it will automagically use all the available CPUs.

A sample implementation of the block decomposition follows. Only additions and changes to your original code are shown:

  ! ADD: Use the OpenMP module
  Use :: omp_lib

  ! ADD: Variables used for the decomposition
  Integer :: ithr, istart, iend

  ! CHANGE: OpenMP with block decomposition
  !$omp parallel private(ithr, istart, iend)
    ithr = omp_get_thread_num()

    ! First index (plane) in B for the current thread
    istart = ithr * nd / omp_get_num_threads()
    ! First index (plane) in B for the next thread
    iend = (ithr + 1) * nd / opm_get_num_threads()

    Call dgemm('N', 'N', na, nc * (iend - istart), nb, 1.0_wp, a, nd, &
               b(1, 1, 1 + istart), Size(b, Dim = 1), &
               0.0_wp, c4(1, 1, 1 + istart), Size(c4, Dim = 1))
  !$omp end parallel

istart is the index of the first plane of B that each individual thread works on. iend is the first plane for the next thread, so iend - istart is the number of planes for the current thread. b(1, 1, 1 + istart) is where the block of planes in B starts. c4(1, 1, 1 + istart) is where the block in the result tensor starts.

Make sure you do one of those but not both simultaneously. I.e., if your BLAS implementation is threaded but you decide to go with block decomposition, disable threading in the BLAS library. Conversely, if you use threading in the BLAS implementation, do not perform block decomposition in your code.