multithreading fortran openmp race-condition

OpenMP: apparent race condition when increasing number of threads

I have this piece of code (removed just the initialisation of variable that's rather long and outside the parallel region). I am testing it on a local machine (4 physical cores, 8 threads) and comparing speed and result with its serial version. When I run the code with more than 4 threads it seems to incur sometimes in some race condition and the final output (variable T written to disk after the parallel region) is different in the two cases. When I run with 4 or less threads everything is fine, both code run with the same number of iterations and give the final result. From the documentation, there is an implicit synchronisation at the end of each OMP DO block (unless you specify nowait).

program test

integer :: nx=500,ny=500
integer :: i,j,iteration

double precision, allocatable, dimension(:,:) :: T, T_old
double precision :: dx,dy,dt
double precision :: error,change,delta,errtol

allocate(T(0:nx+1,0:ny+1))
allocate(T_old(0:nx+1,0:ny+1))

! initialisation of T, T_old, dt, dx, dy and errtol

error=1.0d0
iteration=0

!$OMP PARALLEL SHARED(error,iteration,change) private(i,j,delta)

do while (error.gt.errtol.and.error.lt.10.0d0)
  change=0.0d0
!$OMP DO schedule(static) reduction(max:change)
  do j=1,ny
    do i=1,nx
      delta=dt*( (T_old(i+1,j)-2.0d0*T_old(i,j)+T_old(i-1,j))/dx**2 + &
                  (T_old(i,j+1)-2.0d0*T_old(i,j)+T_old(i,j-1))/dy**2  )
      T(i,j)=T_old(i,j)+delta
      change=max(delta,change)
    enddo
  enddo
!$OMP END DO
! implicit barrier (implies FLUSH) at end of parallel do region (unless you specify nowait clause)


!$OMP SINGLE
  error=change
! just one thread updates iteration
  iteration=iteration+1
  ! write(*,*) iteration, error
!$OMP END SINGLE

!$OMP DO schedule(static)
  ! update T_old
  do j=1,ny
    do i=1,nx
      T_old(i,j)=T(i,j)
    enddo
  enddo
!$OMP END DO
enddo
!$OMP END PARALLEL

! write T to disk

deallocate(T,T_old)

end program test

EDIT: correct code, see @Gilles comment:

program test

integer :: nx=500,ny=500
integer :: i,j,iteration

double precision, allocatable, dimension(:,:) :: T, T_old
double precision :: dx,dy,dt
double precision :: error,change,delta,errtol

allocate(T(0:nx+1,0:ny+1))
allocate(T_old(0:nx+1,0:ny+1))

! initialisation of T, T_old, dt, dx, dy and errtol

error=1.0d0
iteration=0
change=0.0d0

!$OMP PARALLEL SHARED(error,iteration,change) private(i,j,delta)

do while (error.gt.errtol.and.error.lt.10.0d0)
!$OMP DO schedule(static) reduction(max:change)
  do j=1,ny
    do i=1,nx
      delta=dt*( (T_old(i+1,j)-2.0d0*T_old(i,j)+T_old(i-1,j))/dx**2 + &
                  (T_old(i,j+1)-2.0d0*T_old(i,j)+T_old(i,j-1))/dy**2  )
      T(i,j)=T_old(i,j)+delta
      change=max(delta,change)
    enddo
  enddo
!$OMP END DO
! implicit barrier (implies FLUSH) at end of parallel do region (unless you specify nowait clause)


!$OMP SINGLE
  error=change
  change=0.0d0
! just one thread updates iteration
  iteration=iteration+1
  ! write(*,*) iteration, error
!$OMP END SINGLE

!$OMP DO schedule(static)
  ! update T_old
  do j=1,ny
    do i=1,nx
      T_old(i,j)=T(i,j)
    enddo
  enddo
!$OMP END DO
enddo
!$OMP END PARALLEL

! write T to disk

deallocate(T,T_old)

end program test

Solution

Race condition on the reinitialization of the variable change in the DO WHILE loop has been removed. Solved by initializing change outside the parallel region and protecting its update in the parallel region with a !$OMP SINGLE directive.

program test

integer :: nx=500,ny=500
integer :: i,j,iteration

double precision, allocatable, dimension(:,:) :: T, T_old
double precision :: dx,dy,dt
double precision :: error,change,delta,errtol

allocate(T(0:nx+1,0:ny+1))
allocate(T_old(0:nx+1,0:ny+1))

! initialisation of T, T_old, dt, dx, dy and errtol

error=1.0d0
iteration=0
change=0.0d0

!$OMP PARALLEL SHARED(error,iteration,change) private(i,j,delta)

do while (error.gt.errtol.and.error.lt.10.0d0)
!$OMP DO schedule(static) reduction(max:change)
  do j=1,ny
    do i=1,nx
      delta=dt*( (T_old(i+1,j)-2.0d0*T_old(i,j)+T_old(i-1,j))/dx**2 + &
                  (T_old(i,j+1)-2.0d0*T_old(i,j)+T_old(i,j-1))/dy**2  )
      T(i,j)=T_old(i,j)+delta
      change=max(delta,change)
    enddo
  enddo
!$OMP END DO
! implicit barrier (implies FLUSH) at end of parallel do region (unless you specify nowait clause)


!$OMP SINGLE
  error=change
  change=0.0d0
! just one thread updates iteration
  iteration=iteration+1
  ! write(*,*) iteration, error
!$OMP END SINGLE

!$OMP DO schedule(static)
  ! update T_old
  do j=1,ny
    do i=1,nx
      T_old(i,j)=T(i,j)
    enddo
  enddo
!$OMP END DO
enddo
!$OMP END PARALLEL

! write T to disk

deallocate(T,T_old)

end program test