I tried to extend the code in the answer of the above link, to include cross checks and openmp.
Program reshape_for_blas
Use, Intrinsic :: iso_fortran_env, Only : wp => real64, li => int64
Implicit None
Real( wp ), Dimension( :, : ), Allocatable :: a
Real( wp ), Dimension( :, :, : ), Allocatable :: b
Real( wp ), Dimension( :, :, : ), Allocatable :: c1, c2, c3, c4, c5
Real( wp ), Dimension( :, : ), Allocatable :: d
Real( wp ), Dimension( :, : ), Allocatable :: e
Integer :: na, nb, nc, nd, ne
Integer :: la, lb, lc, ld
Integer( li ) :: start, finish, rate, numthreads
numthreads = 2
call omp_set_num_threads(numthreads)
Write( *, * ) 'na, nb, nc, nd ?'
Read( *, * ) na, nb, nc, nd
ne = nc * nd
Allocate( a ( 1:na, 1:nb ) )
Allocate( b ( 1:nb, 1:nc, 1:nd ) )
Allocate( c1( 1:na, 1:nc, 1:nd ) )
Allocate( c2( 1:na, 1:nc, 1:nd ) )
Allocate( c3( 1:na, 1:nc, 1:nd ) )
Allocate( c4( 1:na, 1:nc, 1:nd ) )
Allocate( c5( 1:na, 1:nc, 1:nd ) )
Allocate( d ( 1:nb, 1:ne ) )
Allocate( e ( 1:na, 1:ne ) )
! Set up some data
Call Random_number( a )
Call Random_number( b )
! With reshapes
Call System_clock( start, rate )
!write (*,*) 'clock', start, rate
d = Reshape( b, Shape( d ) )
Call dgemm( 'N', 'N', na, ne, nb, 1.0_wp, a, Size( a, Dim = 1 ), &
d, Size( d, Dim = 1 ), &
0.0_wp, e, Size( e, Dim = 1 ) )
c1 = Reshape( e, Shape( c1 ) )
Call System_clock( finish, rate )
!write (*,*) 'clock', finish, rate
Write( *, * ) 'Time for reshaping method ', Real( finish - start, wp ) / rate
Write( *, * ) 'Difference between result matrices ', Maxval( Abs( c1 - c2 ) )
! Direct
Call System_clock( start, rate )
Call dgemm( 'N', 'N', na, ne, nb, 1.0_wp, a , Size( a , Dim = 1 ), &
b , Size( b , Dim = 1 ), &
0.0_wp, c2, Size( c2, Dim = 1 ) )
Call System_clock( finish, rate )
Write( *, * ) 'Time for straight method ', Real( finish - start, wp ) / rate
Call System_clock( start, rate )
!$omp parallel
! Direct
Call dgemm( 'N', 'N', na, ne, nb, 1.0_wp, a , Size( a , Dim = 1 ), &
b , Size( b , Dim = 1 ), &
0.0_wp, c4, Size( c4, Dim = 1 ) )
!$omp end parallel
Call System_clock( finish, rate )
Write( *, * ) 'Time for straight method omp', Real( finish - start, wp ) / rate
!naive
Call System_clock( start, rate )
do la = 1, na
do lc = 1, nc
do ld = 1, nd
c3(la,lc,ld) = 0.0_wp
enddo
enddo
enddo
do la = 1, na
do lb = 1, nb
do lc = 1, nc
do ld = 1, nd
c3(la,lc,ld) = c3(la,lc,ld) + a(la,lb) * b(lb, lc, ld)
enddo
enddo
enddo
enddo
Call System_clock( finish, rate )
Write( *, * ) 'Time for loop', Real( finish - start, wp ) / rate
!naive omp
Call System_clock( start, rate )
!$omp parallel
do la = 1, na
do lc = 1, nc
do ld = 1, nd
c5(la,lc,ld) = 0.0_wp
enddo
enddo
enddo
!$omp do private(la, lb, lc, ld) schedule(dynamic) reduction(+: c5)
do la = 1, na
do lb = 1, nb
do lc = 1, nc
do ld = 1, nd
c5(la,lc,ld) = c5(la,lc,ld) + a(la,lb) * b(lb, lc, ld)
enddo
enddo
enddo
enddo
!$omp end do
!$omp end parallel
Call System_clock( finish, rate )
Write( *, * ) 'Time for loop omp', Real( finish - start, wp ) / rate
do la = 1, na
do lc = 1, nc
do ld = 1, nd
if ( dabs(c3(la,lc,ld) - c1(la,lc,ld)) > 1.e-6 ) then
write (*,*) '!!! c1', c3(la,lc,ld) - c1(la,lc,ld)
endif
if ( dabs(c3(la,lc,ld) - c2(la,lc,ld)) > 1.e-6 ) then
write (*,*) '!!! c2', c3(la,lc,ld) - c2(la,lc,ld)
endif
if ( dabs(c3(la,lc,ld) - c4(la,lc,ld)) > 1.e-6 ) then
write (*,*) '!!! c4', la,lc,ld, c3(la,lc,ld) - c4(la,lc,ld)
endif
if ( dabs(c3(la,lc,ld) - c5(la,lc,ld)) > 1.e-6 ) then
write (*,*) '!!! c5', la,lc,ld, c3(la,lc,ld) - c5(la,lc,ld)
endif
enddo
enddo
enddo
End Program reshape_for_blas
I got two issues:
gfortran -std=f2008 -Wall -Wextra -fcheck=all reshape.f90 -lblas -fopenmp
, and input 30 30 30 60
, I got30 30 30 60
Time for reshaping method 2.9443999999999998E-003
Difference between result matrices 12.380937791257775
Time for straight method 1.0016000000000001E-003
Time for straight method omp 2.4878000000000001E-003
Time for loop 6.6072500000000006E-002
Time for loop omp 0.100242600000000002
60 60 60 60
in the input, the openmp BLAS result can get different value than naive loop, seems I miss some control option.What would be the problems with OpenMP here?
Edit
I added omp lines in the initialization in the c5
section and commented out two printing lines,
Program reshape_for_blas
Use, Intrinsic :: iso_fortran_env, Only : wp => real64, li => int64
Implicit None
Real( wp ), Dimension( :, : ), Allocatable :: a
Real( wp ), Dimension( :, :, : ), Allocatable :: b
Real( wp ), Dimension( :, :, : ), Allocatable :: c1, c2, c3, c4, c5
Real( wp ), Dimension( :, : ), Allocatable :: d
Real( wp ), Dimension( :, : ), Allocatable :: e
Integer :: na, nb, nc, nd, ne
Integer :: la, lb, lc, ld
Integer( li ) :: start, finish, rate, numthreads
numthreads = 2
call omp_set_num_threads(numthreads)
Write( *, * ) 'na, nb, nc, nd ?'
Read( *, * ) na, nb, nc, nd
ne = nc * nd
Allocate( a ( 1:na, 1:nb ) )
Allocate( b ( 1:nb, 1:nc, 1:nd ) )
Allocate( c1( 1:na, 1:nc, 1:nd ) )
Allocate( c2( 1:na, 1:nc, 1:nd ) )
Allocate( c3( 1:na, 1:nc, 1:nd ) )
Allocate( c4( 1:na, 1:nc, 1:nd ) )
Allocate( c5( 1:na, 1:nc, 1:nd ) )
Allocate( d ( 1:nb, 1:ne ) )
Allocate( e ( 1:na, 1:ne ) )
! Set up some data
Call Random_number( a )
Call Random_number( b )
! With reshapes
Call System_clock( start, rate )
!write (*,*) 'clock', start, rate
d = Reshape( b, Shape( d ) )
Call dgemm( 'N', 'N', na, ne, nb, 1.0_wp, a, Size( a, Dim = 1 ), &
d, Size( d, Dim = 1 ), &
0.0_wp, e, Size( e, Dim = 1 ) )
c1 = Reshape( e, Shape( c1 ) )
Call System_clock( finish, rate )
!write (*,*) 'clock', finish, rate
Write( *, * ) 'Time for reshaping method ', Real( finish - start, wp ) / rate
Write( *, * ) 'Difference between result matrices ', Maxval( Abs( c1 - c2 ) )
! Direct
Call System_clock( start, rate )
Call dgemm( 'N', 'N', na, ne, nb, 1.0_wp, a , Size( a , Dim = 1 ), &
b , Size( b , Dim = 1 ), &
0.0_wp, c2, Size( c2, Dim = 1 ) )
Call System_clock( finish, rate )
Write( *, * ) 'Time for straight method ', Real( finish - start, wp ) / rate
!naive loop
Call System_clock( start, rate )
do la = 1, na
do lc = 1, nc
do ld = 1, nd
c3(la,lc,ld) = 0.0_wp
enddo
enddo
enddo
do la = 1, na
do lb = 1, nb
do lc = 1, nc
do ld = 1, nd
c3(la,lc,ld) = c3(la,lc,ld) + a(la,lb) * b(lb, lc, ld)
enddo
enddo
enddo
enddo
Call System_clock( finish, rate )
Write( *, * ) 'Time for loop', Real( finish - start, wp ) / rate
!dgemm omp
Call System_clock( start, rate )
!$omp parallel
! Direct
Call dgemm( 'N', 'N', na, ne, nb, 1.0_wp, a , Size( a , Dim = 1 ), &
b , Size( b , Dim = 1 ), &
0.0_wp, c4, Size( c4, Dim = 1 ) )
!$omp end parallel
Call System_clock( finish, rate )
Write( *, * ) 'Time for straight method omp', Real( finish - start, wp ) / rate
!loop omp
Call System_clock( start, rate )
!$omp parallel
do la = 1, na
do lc = 1, nc
do ld = 1, nd
c5(la,lc,ld) = 0.0_wp
enddo
enddo
enddo
!$omp do private(la, lb, lc, ld) schedule(dynamic) reduction(+: c5)
do la = 1, na
do lb = 1, nb
do lc = 1, nc
do ld = 1, nd
c5(la,lc,ld) = c5(la,lc,ld) + a(la,lb) * b(lb, lc, ld)
enddo
enddo
enddo
enddo
!$omp end do
!$omp end parallel
Call System_clock( finish, rate )
Write( *, * ) 'Time for loop omp', Real( finish - start, wp ) / rate
!single core: c1 c2 c3
! c1 reshape blas
! c2 blas
! c3 naive loop (reference)
! parallel: c4 c5
! c4 dgemm parallel
! c5 naive loop parallel
do la = 1, na
do lc = 1, nc
do ld = 1, nd
if ( dabs(c3(la,lc,ld) - c1(la,lc,ld)) > 1.e-6 ) then
write (*,*) '!!! c1', c3(la,lc,ld) - c1(la,lc,ld)
endif
if ( dabs(c3(la,lc,ld) - c2(la,lc,ld)) > 1.e-6 ) then
write (*,*) '!!! c2', c3(la,lc,ld) - c2(la,lc,ld)
endif
if ( dabs(c3(la,lc,ld) - c4(la,lc,ld)) > 1.e-6 ) then
write (*,*) '!!! c4', la,lc,ld, c3(la,lc,ld) - c4(la,lc,ld)
endif
if ( dabs(c3(la,lc,ld) - c5(la,lc,ld)) > 1.e-6 ) then
write (*,*) '!!! c5', la,lc,ld, c3(la,lc,ld) - c5(la,lc,ld)
endif
enddo
enddo
enddo
End Program reshape_for_blas
then gfortran reshape.f90 -lblas -fopenmp
, and 30 30 30 30
input lead to
Time for reshaping method 1.3519000000000001E-003
Difference between result matrices 12.380937791257775
Time for straight method 6.2549999999999997E-004
Time for straight method omp 1.2600000000000001E-003
Time for naive loop 1.0008599999999999E-002
Time for naive loop omp 1.6678999999999999E-002
not good speed up though.
You are calling DGEMM
in parallel using the same set of variables (because variables in parallel regions are shared by default in Fortran). This doesn't work and produces weird results due to data races. You have two options:
Find a parallel BLAS implementation where DGEMM
is already threaded. Intel MKL and OpenBLAS are prime candidates. Intel MKL uses OpenMP, more specifically it is built with Intel OpenMP runtime, so it may not play very nicely with OpenMP code compiled with GCC, but it works perfectly with non-threaded code.
Call DGEMM
in parallel but not with the same set of arguments. Instead, perform block decomposition of one or both tensors and have each thread do the contraction for a separate block. Since Fortran uses column-major storage, it may be appropriate to decompose the second tensor:
C[i,k,l=1..L] = A[i,j] * B[j,k,l=1..L]
becomes with two threads:
thread 0: C[i,k,l=1..L/2] = A[i,j] * B[j,k,l=1..L/2]
thread 1: C[i,k,l=L/2+1..L] = A[i,j] * B[j,k,l=L/2+1..L]
With an arbitrary number of threads it boils down to computing the start and end values for the l
index in each thread and adjusting the arguments of DGEMM
accordingly.
Personally, I'd go with a parallel BLAS implementation. With Intel MKL, you only need to link in the parallel driver and it will automagically use all the available CPUs.
A sample implementation of the block decomposition follows. Only additions and changes to your original code are shown:
! ADD: Use the OpenMP module
Use :: omp_lib
! ADD: Variables used for the decomposition
Integer :: ithr, istart, iend
! CHANGE: OpenMP with block decomposition
!$omp parallel private(ithr, istart, iend)
ithr = omp_get_thread_num()
! First index (plane) in B for the current thread
istart = ithr * nd / omp_get_num_threads()
! First index (plane) in B for the next thread
iend = (ithr + 1) * nd / opm_get_num_threads()
Call dgemm('N', 'N', na, nc * (iend - istart), nb, 1.0_wp, a, nd, &
b(1, 1, 1 + istart), Size(b, Dim = 1), &
0.0_wp, c4(1, 1, 1 + istart), Size(c4, Dim = 1))
!$omp end parallel
istart
is the index of the first plane of B
that each individual thread works on. iend
is the first plane for the next thread, so iend - istart
is the number of planes for the current thread. b(1, 1, 1 + istart)
is where the block of planes in B starts. c4(1, 1, 1 + istart)
is where the block in the result tensor starts.
Make sure you do one of those but not both simultaneously. I.e., if your BLAS implementation is threaded but you decide to go with block decomposition, disable threading in the BLAS library. Conversely, if you use threading in the BLAS implementation, do not perform block decomposition in your code.