I have this piece of code (removed just the initialisation of variable that's rather long and outside the parallel region). I am testing it on a local machine (4 physical cores, 8 threads) and comparing speed and result with its serial version. When I run the code with more than 4 threads it seems to incur sometimes in some race condition and the final output (variable T written to disk after the parallel region) is different in the two cases. When I run with 4 or less threads everything is fine, both code run with the same number of iterations and give the final result. From the documentation, there is an implicit synchronisation at the end of each OMP DO block (unless you specify nowait).
program test
integer :: nx=500,ny=500
integer :: i,j,iteration
double precision, allocatable, dimension(:,:) :: T, T_old
double precision :: dx,dy,dt
double precision :: error,change,delta,errtol
allocate(T(0:nx+1,0:ny+1))
allocate(T_old(0:nx+1,0:ny+1))
! initialisation of T, T_old, dt, dx, dy and errtol
error=1.0d0
iteration=0
!$OMP PARALLEL SHARED(error,iteration,change) private(i,j,delta)
do while (error.gt.errtol.and.error.lt.10.0d0)
change=0.0d0
!$OMP DO schedule(static) reduction(max:change)
do j=1,ny
do i=1,nx
delta=dt*( (T_old(i+1,j)-2.0d0*T_old(i,j)+T_old(i-1,j))/dx**2 + &
(T_old(i,j+1)-2.0d0*T_old(i,j)+T_old(i,j-1))/dy**2 )
T(i,j)=T_old(i,j)+delta
change=max(delta,change)
enddo
enddo
!$OMP END DO
! implicit barrier (implies FLUSH) at end of parallel do region (unless you specify nowait clause)
!$OMP SINGLE
error=change
! just one thread updates iteration
iteration=iteration+1
! write(*,*) iteration, error
!$OMP END SINGLE
!$OMP DO schedule(static)
! update T_old
do j=1,ny
do i=1,nx
T_old(i,j)=T(i,j)
enddo
enddo
!$OMP END DO
enddo
!$OMP END PARALLEL
! write T to disk
deallocate(T,T_old)
end program test
EDIT: correct code, see @Gilles comment:
program test
integer :: nx=500,ny=500
integer :: i,j,iteration
double precision, allocatable, dimension(:,:) :: T, T_old
double precision :: dx,dy,dt
double precision :: error,change,delta,errtol
allocate(T(0:nx+1,0:ny+1))
allocate(T_old(0:nx+1,0:ny+1))
! initialisation of T, T_old, dt, dx, dy and errtol
error=1.0d0
iteration=0
change=0.0d0
!$OMP PARALLEL SHARED(error,iteration,change) private(i,j,delta)
do while (error.gt.errtol.and.error.lt.10.0d0)
!$OMP DO schedule(static) reduction(max:change)
do j=1,ny
do i=1,nx
delta=dt*( (T_old(i+1,j)-2.0d0*T_old(i,j)+T_old(i-1,j))/dx**2 + &
(T_old(i,j+1)-2.0d0*T_old(i,j)+T_old(i,j-1))/dy**2 )
T(i,j)=T_old(i,j)+delta
change=max(delta,change)
enddo
enddo
!$OMP END DO
! implicit barrier (implies FLUSH) at end of parallel do region (unless you specify nowait clause)
!$OMP SINGLE
error=change
change=0.0d0
! just one thread updates iteration
iteration=iteration+1
! write(*,*) iteration, error
!$OMP END SINGLE
!$OMP DO schedule(static)
! update T_old
do j=1,ny
do i=1,nx
T_old(i,j)=T(i,j)
enddo
enddo
!$OMP END DO
enddo
!$OMP END PARALLEL
! write T to disk
deallocate(T,T_old)
end program test
Race condition on the reinitialization of the variable change
in the DO WHILE loop has been removed. Solved by initializing change
outside the parallel region and protecting its update in the parallel region with a !$OMP SINGLE
directive.
program test
integer :: nx=500,ny=500
integer :: i,j,iteration
double precision, allocatable, dimension(:,:) :: T, T_old
double precision :: dx,dy,dt
double precision :: error,change,delta,errtol
allocate(T(0:nx+1,0:ny+1))
allocate(T_old(0:nx+1,0:ny+1))
! initialisation of T, T_old, dt, dx, dy and errtol
error=1.0d0
iteration=0
change=0.0d0
!$OMP PARALLEL SHARED(error,iteration,change) private(i,j,delta)
do while (error.gt.errtol.and.error.lt.10.0d0)
!$OMP DO schedule(static) reduction(max:change)
do j=1,ny
do i=1,nx
delta=dt*( (T_old(i+1,j)-2.0d0*T_old(i,j)+T_old(i-1,j))/dx**2 + &
(T_old(i,j+1)-2.0d0*T_old(i,j)+T_old(i,j-1))/dy**2 )
T(i,j)=T_old(i,j)+delta
change=max(delta,change)
enddo
enddo
!$OMP END DO
! implicit barrier (implies FLUSH) at end of parallel do region (unless you specify nowait clause)
!$OMP SINGLE
error=change
change=0.0d0
! just one thread updates iteration
iteration=iteration+1
! write(*,*) iteration, error
!$OMP END SINGLE
!$OMP DO schedule(static)
! update T_old
do j=1,ny
do i=1,nx
T_old(i,j)=T(i,j)
enddo
enddo
!$OMP END DO
enddo
!$OMP END PARALLEL
! write T to disk
deallocate(T,T_old)
end program test