Search code examples
fortranopenmprace-condition

Possible race condition despite barrier at every step in OpenMP


I am trying to write a fortran application using openMP. I have extensive experience in MPI but still struggling with OMP syncronization. Please consider the following example:

module ps_mod

    contains
    subroutine cg(A,b,N)
        use iso_fortran_env
        implicit none
        real(real64)        :: A(:,:), b(:)
        integer(int32)      :: N

        integer                   :: i
        real(real64)              :: alpha
        real(real64), allocatable :: v(:), p0(:)


        allocate(v(N))
        allocate(p0(N))

        !$omp parallel shared(v,p0,A,b,i)
        v(:)   = 10000.
        alpha = 1.0_real64
        p0(:) = b(:)

        !$omp barrier
        call dotp(p0, v, alpha, N)
        !$omp barrier

        alpha = 10.0/alpha
        !$omp barrier

        print *, "alpha", alpha

        !$omp barrier
        !$omp end parallel

    end subroutine cg

    subroutine dotp(x,y,res,n)
        !res = sum(x*y)
        use iso_fortran_env
        implicit none
        integer(int32)  :: n, j
        real(real64)    :: x(n), y(n), res

        res=0.0_real64
        !$omp barrier
        !$omp do private(j) reduction(+:res)
        do j = 1, n
            res = res + x(j)*y(j)
        enddo
        !$omp barrier
    end subroutine dotp

end module ps_mod

!------------------------ main function
program main_omp
    use iso_fortran_env
    use ps_mod
    implicit none
    real(real64), allocatable           :: mat(:,:), y(:)
    integer(int32) :: n, i

    n = 8000
    allocate(mat(n, n))
    allocate(y(n))

    mat = 0.0_real64
    do i = 1,n
        mat(i,i) = 1.0_real64
    enddo
    y = 0.2_real64

    call cg(mat, y, n)
end program main_omp

It takes simple matrix and vector and performs some calculations on them, reducing the output in one variable alpha.

I compiled it using gfortran 7.3.1 gfortran -O3 -fopenmp main_omp.f90 and ran with 5 threads using export OMP_NUM_THREADS=5; the output I am getting is changing between runs for i in $(seq 1 20);do ./a.out ;done.

...
 alpha   6.2500000000000005E-007
 alpha   15999999.999999998     
 alpha   15999999.999999998     
 alpha   15999999.999999998     
 alpha   15999999.999999998     
 alpha   15999999.999999998     
 alpha   6.2500000000000005E-007
 alpha   6.2500000000000005E-007
...

with lesser threads also its the same, but bit infrequent. However with single thread I always get the same result. hence iI do not think its some misallocated memory issue (in my main program I also tested with libasan and it says no leak). Therefore I believe its an open MP race condition but I cant diagnose where. As you can see I have explicitly give barriers everywhere.


Update:

After conversations with @Giles I was figured out that following construct does work if i use a dummy variable and explicitly make alpha private, but I do not know why.

real(real64)::s
!$omp parallel shared(v,p0,A,b,i) private(alpha)
        v(:)   = 10000.
        alpha = 1.0_real64
        p0(:) = b(:)

        !$omp barrier
        call dotp(p0, v, s, N) !<----- dummy s
        !$omp barrier
        alpha = s  !<------ assign dummy s value to private alpha
        !$omp barrier
        alpha = 10.0/alpha

Solution

  • You might have a synchronisation at every line, but you don't have a synchronisation between every memory access, and that is what you need to be absolutely sure of ordering. In particular the line

    alpha = 1.0 / alpha
    

    Has multiple threads read from and writing to a shared variable possibly at the same time; there is no ordering of the memory accesses just because they are on the same line. A simple example:

    ijb@LAPTOP-GUG8KQ9I:~/work/stack$ cat race.f90 
    Program in_line_racing
    
      Implicit None
    
      Real :: alpha, s
    
      alpha = 3.0
    
      !$omp parallel default( none ) shared( alpha ) private( s )
      alpha = 1.0 / alpha
      s = alpha
      Write( *, * ) s
      !$omp end parallel
    
    End Program in_line_racing
    ijb@LAPTOP-GUG8KQ9I:~/work/stack$ gfortran --version
    GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
    Copyright (C) 2019 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions.  There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
    
    ijb@LAPTOP-GUG8KQ9I:~/work/stack$ gfortran -fopenmp race.f90 
    ijb@LAPTOP-GUG8KQ9I:~/work/stack$ export OMP_NUM_THREADS=4
    ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out 
      0.333333343    
      0.333333343    
      0.333333343    
       3.00000000    
    ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out 
       3.00000000    
       3.00000000    
      0.333333343    
      0.333333343    
    ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out 
       3.00000000    
       3.00000000    
      0.333333343    
      0.333333343    
    ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out 
       3.00000000    
      0.333333343    
       3.00000000    
      0.333333343    
    ijb@LAPTOP-GUG8KQ9I:~/work/stack$ ./a.out 
       3.00000000    
      0.333333343    
      0.333333343    
      0.333333343 
    

    You need the result of the reduction to be a shared variable, so your updated version works because you first make sure the reduced variable is completely up to date on all threads, and then perform the inverse on a private variable which can't suffer race conditions.

    Welcome to threaded programming. Bugs are very easy!