Search code examples
fortranintel-fortranfortran-coarrays

Coarray deadlock in do-cycle-exit


A strange phenomenon occurs in the following Coarray code

program strange

implicit none
integer :: counter = 0
logical :: co_missionAccomplished[*]

co_missionAccomplished = .false.

sync all

do
  if (this_image()==1) then
    counter = counter+1
    if (counter==2) co_missionAccomplished = .true.
    sync images(*)
  else
    sync images(1)
  end if
  if (co_missionAccomplished[1]) exit
  cycle
end do

write(*,*) "missionAccomplished on image ", this_image()

end program strange

This program never ends, as it appears that there is a deadlock for any counter threshold beyond 1 inside the loop. The code is compiled with Intel Fortran 2018 Windows OS, with the following flags:

ifort /debug /Qcoarray=shared /standard-semantics /traceback /gen-interfaces /check /fpe:0 normal.f90 -o run.exe

The same code, using DO WHILE construct, also appears to suffer from the same phenomenon:

program strange

implicit none
integer :: counter = 0
logical :: co_missionAccomplished[*]

co_missionAccomplished = .true.

sync all

do while(co_missionAccomplished[1])
  if (this_image()==1) then
    counter = counter+1
    if (counter==2) co_missionAccomplished = .false.
    sync images(*)
  else
    sync images(1)
  end if
end do

write(*,*) "missionAccomplished on image ", this_image()

end program strange

This seems now too trivial to be a compiler bug, so I am probably missing something important about do-loops in parallel. any help is appreciated.

UPDATE:

Adding a SYNC ALL statement before the CYCLE statement in the DO-CYCLE-EXIT example program above resolves the deadlock. Also, a SYNC ALL statement right after DO WHILE statement, as the first line of the block resolves the deadlock. So apparently, all the images must be synced to avoid a deadlock before each cycle of the loops in either case above.


Solution

  • Regarding "This seems now too trivial to be a compiler bug", you may be very surprised about how seemingly trivial things can be treated incorrectly by a compiler. Few things relating to coarrays are trivial.

    Consider the following program which is related:

      implicit none
      integer i[*]
    
      do i=1,1
         sync all
         print '(I1)', i[1]
      end do
    
    end
    

    I get the initially surprising output

    1
    2
    

    when run with two images under ifort 2018.1.

    Let's look at what's going on.

    In my loop, i[1] first has value 1 when the images are synchronized. However, by the time the second image accesses the value, it's been changed by the first image ending its iteration.

    We solve that little problem by putting an extra synchronization statement before the end do.

    How is this program related to the one of the question? It's the same lack of synchronization between testing a value on a remote image and that image updating it.

    Between the synchronization and other images testing the value of co_missionAccomplished[1], the first image may dash around and update counter, then co_missionAccomplished. Some images may see the exit state in their first iteration.