I am trying to write a fortran application using openMP. I have extensive experience in MPI but still struggling with OMP syncronization. Please consider the following example:
module ps_mod
contains
subroutine cg(A,b,N)
use iso_fortran_env
implicit none
real(real64) :: A(:,:), b(:)
integer(int32) :: N
integer :: i
real(real64) :: alpha
real(real64), allocatable :: v(:), p0(:)
allocate(v(N))
allocate(p0(N))
!$omp parallel shared(v,p0,A,b,i)
v(:) = 10000.
alpha = 1.0_real64
p0(:) = b(:)
!$omp barrier
call dotp(p0, v, alpha, N)
!$omp barrier
alpha = 10.0/alpha
!$omp barrier
print *, "alpha", alpha
!$omp barrier
!$omp end parallel
end subroutine cg
subroutine dotp(x,y,res,n)
!res = sum(x*y)
use iso_fortran_env
implicit none
integer(int32) :: n, j
real(real64) :: x(n), y(n), res
res=0.0_real64
!$omp barrier
!$omp do private(j) reduction(+:res)
do j = 1, n
res = res + x(j)*y(j)
enddo
!$omp barrier
end subroutine dotp
end module ps_mod
!------------------------ main function
program main_omp
use iso_fortran_env
use ps_mod
implicit none
real(real64), allocatable :: mat(:,:), y(:)
integer(int32) :: n, i
n = 8000
allocate(mat(n, n))
allocate(y(n))
mat = 0.0_real64
do i = 1,n
mat(i,i) = 1.0_real64
enddo
y = 0.2_real64
call cg(mat, y, n)
end program main_omp
It takes simple matrix and vector and performs some calculations on them, reducing the output in one variable alpha.
I compiled it using gfortran 7.3.1 gfortran -O3 -fopenmp main_omp.f90
and ran with 5 threads using export OMP_NUM_THREADS=5
; the output I am getting is changing between runs for i in $(seq 1 20);do ./a.out ;done
.
...
alpha 6.2500000000000005E-007
alpha 15999999.999999998
alpha 15999999.999999998
alpha 15999999.999999998
alpha 15999999.999999998
alpha 15999999.999999998
alpha 6.2500000000000005E-007
alpha 6.2500000000000005E-007
...
with lesser threads also its the same, but bit infrequent. However with single thread I always get the same result. hence iI do not think its some misallocated memory issue (in my main program I also tested with libasan and it says no leak). Therefore I believe its an open MP race condition but I cant diagnose where. As you can see I have explicitly give barriers everywhere.
Update:
After conversations with @Giles I was figured out that following construct does work if i use a dummy variable and explicitly make alpha private, but I do not know why.
real(real64)::s
!$omp parallel shared(v,p0,A,b,i) private(alpha)
v(:) = 10000.
alpha = 1.0_real64
p0(:) = b(:)
!$omp barrier
call dotp(p0, v, s, N) !<----- dummy s
!$omp barrier
alpha = s !<------ assign dummy s value to private alpha
!$omp barrier
alpha = 10.0/alpha
You might have a synchronisation at every line, but you don't have a synchronisation between every memory access, and that is what you need to be absolutely sure of ordering. In particular the line
alpha = 1.0 / alpha
Has multiple threads read from and writing to a shared variable possibly at the same time; there is no ordering of the memory accesses just because they are on the same line. A simple example:
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ cat race.f90
Program in_line_racing
Implicit None
Real :: alpha, s
alpha = 3.0
!$omp parallel default( none ) shared( alpha ) private( s )
alpha = 1.0 / alpha
s = alpha
Write( *, * ) s
!$omp end parallel
End Program in_line_racing
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ gfortran --version
GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ gfortran -fopenmp race.f90
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ export OMP_NUM_THREADS=4
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out
0.333333343
0.333333343
0.333333343
3.00000000
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out
3.00000000
3.00000000
0.333333343
0.333333343
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out
3.00000000
3.00000000
0.333333343
0.333333343
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out
3.00000000
0.333333343
3.00000000
0.333333343
ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out
3.00000000
0.333333343
0.333333343
0.333333343
You need the result of the reduction to be a shared variable, so your updated version works because you first make sure the reduced variable is completely up to date on all threads, and then perform the inverse on a private variable which can't suffer race conditions.
Welcome to threaded programming. Bugs are very easy!