Openmp fortran, task dependency on overlapping subarrays

TLDR: task parallel code fails to recognize task dependency on partially overlapping subarrays

I'm trying to parallelize some fortran code using task parallelism. I'm struggling to get the task dependency to work.

The following minimal example illustrates my problem:

program test

   integer, parameter :: n = 100
   double precision :: A(n,n),B(n,n)
   integer :: istart, istop

   !$omp parallel
   !$omp single

   istart = 20
   istop = 50

   !$omp task shared(A,B) firstprivate(istart,istop)&
   !$omp& depend( in: B(istart:istop,istart:istop) )&
   !$omp& depend( inout: A(istart:istop,istart:istop) )
   write(*,*) "starting task 1"
   call sleep(5)
   write(*,*) "stopping task 1"
   !$omp end task

   istart = 30
   istop = 60

   !$omp task shared(A,B) firstprivate(istart,istop)&
   !$omp& depend( in: B(istart:istop,istart:istop) )&
   !$omp& depend( inout: A(istart:istop,istart:istop) )
   write(*,*) "starting task 2"
   call sleep(5)
   write(*,*) "stopping task 2"
   !$omp end task

   !$omp end single nowait
   !$omp end parallel

end program

Which I compile using gcc version 9.4.0 and the -fopenmp flag.

The code has two tasks. Task 1 depends on A(20:50,20:50) and task 2 depends on A(30:60,30:60). These subarrays overlap, so I would expect task 2 to wait until task 1 has completed, but the output I get is:

 starting task 1
 starting task 2
 stopping task 1
 stopping task 2

If I comment out the lines

istart = 30
istop = 60

so that the subarrays are exactly the same instead of just overlapping, task 2 does wait on task 1.

Are task dependencies with overlapping subarrays not supported in openmp? Or am I defining the dependencies wrong somehow?

Solution

These subarrays overlap

This is forbidden by the OpenMP standard as pointed out by @ThijsSteel.

This choice has been done because of the resulting runtime overhead. Indeed, checking overlapping regions of array in dependencies was very expensive in pathological cases (especially for arrays with many dimensions). A typical example of pathological case when many tasks write on the same array (but on different exclusive parts) and then one tasks operate on the whole array. This create a lot of checks and dependencies while this can be optimized using a specific dependency pattern. An even more pathological case is a transposition of the sub-arrays (from n horizontal bands to n vertical bands): this quickly results in n² dependencies which is insanely inefficient when n is large. An optimization consists in aggregating the dependencies so to get 2*n dependencies but this is expensive to do at runtime and a better way would actually not to use dependency but task group synchronization.

From the user point-of-view, there are only few options:

operate at bigger grain even though it means having more synchronization (so possibly poor performance);
try to regroup tasks operating on distinct sub-array by checking that manually. You can cheat OpenMP by using fake dependencies so to do that or add more tasks synchronizations (typically a taskwait). You can also make use of advanced task-dependence-type modifiers like depobj, inoutset or mutexinoutset now available in recent version of the OpenMP standard.

Most of the time, an algorithm is typically composed of a several waves of tasks and each wave is operating on non-overlapping sub-arrays. In this case, you can just add a task synchronization at the end of the computation (or use recursive tasks).

This is a known limitation of OpenMP and some researchers worked on it (including me). I expect new specific dependency modifiers to be added in the future, and possibly user-defined partitioning-related features in the long run (some research runtimes like StarPU does that for example, but not everyone agree adding that in OpenMP as this is a bit high-level and not simple to include).