TLDR: task parallel code fails to recognize task dependency on partially overlapping subarrays
I'm trying to parallelize some fortran code using task parallelism. I'm struggling to get the task dependency to work.
The following minimal example illustrates my problem:
program test
integer, parameter :: n = 100
double precision :: A(n,n),B(n,n)
integer :: istart, istop
!$omp parallel
!$omp single
istart = 20
istop = 50
!$omp task shared(A,B) firstprivate(istart,istop)&
!$omp& depend( in: B(istart:istop,istart:istop) )&
!$omp& depend( inout: A(istart:istop,istart:istop) )
write(*,*) "starting task 1"
call sleep(5)
write(*,*) "stopping task 1"
!$omp end task
istart = 30
istop = 60
!$omp task shared(A,B) firstprivate(istart,istop)&
!$omp& depend( in: B(istart:istop,istart:istop) )&
!$omp& depend( inout: A(istart:istop,istart:istop) )
write(*,*) "starting task 2"
call sleep(5)
write(*,*) "stopping task 2"
!$omp end task
!$omp end single nowait
!$omp end parallel
end program
Which I compile using gcc version 9.4.0 and the -fopenmp
flag.
The code has two tasks. Task 1 depends on A(20:50,20:50)
and task 2 depends on A(30:60,30:60)
. These subarrays overlap, so I would expect task 2 to wait until task 1 has completed, but the output I get is:
starting task 1
starting task 2
stopping task 1
stopping task 2
If I comment out the lines
istart = 30
istop = 60
so that the subarrays are exactly the same instead of just overlapping, task 2 does wait on task 1.
Are task dependencies with overlapping subarrays not supported in openmp? Or am I defining the dependencies wrong somehow?
These subarrays overlap
This is forbidden by the OpenMP standard as pointed out by @ThijsSteel.
This choice has been done because of the resulting runtime overhead. Indeed, checking overlapping regions of array in dependencies was very expensive in pathological cases (especially for arrays with many dimensions). A typical example of pathological case when many tasks write on the same array (but on different exclusive parts) and then one tasks operate on the whole array. This create a lot of checks and dependencies while this can be optimized using a specific dependency pattern. An even more pathological case is a transposition of the sub-arrays (from n
horizontal bands to n
vertical bands): this quickly results in n²
dependencies which is insanely inefficient when n
is large. An optimization consists in aggregating the dependencies so to get 2*n
dependencies but this is expensive to do at runtime and a better way would actually not to use dependency but task group synchronization.
From the user point-of-view, there are only few options:
taskwait
). You can also make use of advanced task-dependence-type modifiers like depobj
, inoutset
or mutexinoutset
now available in recent version of the OpenMP standard.Most of the time, an algorithm is typically composed of a several waves of tasks and each wave is operating on non-overlapping sub-arrays. In this case, you can just add a task synchronization at the end of the computation (or use recursive tasks).
This is a known limitation of OpenMP and some researchers worked on it (including me). I expect new specific dependency modifiers to be added in the future, and possibly user-defined partitioning-related features in the long run (some research runtimes like StarPU does that for example, but not everyone agree adding that in OpenMP as this is a bit high-level and not simple to include).