parallel-processing synchronization task openmp shared-memory

OpenMP taskloop: synchronization between two consecutive taskloop constructs

How is synchronization between two taskloop constructs done? Specifically, in the following pseudo code, if there are more threads available than the number of tasks of the first loop, I believe these free threads are spinning at the implicit barrier at the end of the single construct. Now are these free threads allowed to start executing the second loop concurrently making it unsafe to parallelize things this way (due to the inter-dependency on the array A)?

!$omp parallel
!$omp single

!$omp taskloop num_tasks(10)
DO i=1, 10
    A(i) = foo()
END DO
!$omp end taskloop

!do other stuff

!$omp taskloop
DO j=1, 10
    B(j) = A(j)
END DO
!$omp end taskloop

!$omp end single
!$omp end parallel

I haven't been able to find a clear answer from the API specification: https://www.openmp.org/spec-html/5.0/openmpsu47.html#x71-2080002.10.2

Solution

The taskloop construct by default has an implicit taskgroup around it. With that in mind, what happens for your code is that the single constructs picks any one thread out of the available threads of the parallel team (I'll call that the producer thread). The n-1 other threads are then send straight to the barrier of the single construct and ware waiting for work to arrive (the tasks).

Now with the taskgroup what happens is that producer thread kicks off the creation of the loop tasks, but then waits at the end of the taskloop construct for all the created tasks to finish:

!$omp parallel
!$omp single

!$omp taskloop num_tasks(10)
DO i=1, 10
    A(i) = foo()
END DO
!$omp end taskloop  ! producer waits here for all loop tasks to finish

!do other stuff

!$omp taskloop
DO j=1, 10
    B(j) = A(j)
END DO
!$omp end taskloop ! producer waits here for all loop tasks to finish

!$omp end single
!$omp end parallel

So, if you have less parallelism (= number of tasks created by the first taskloop) than the n-1 worker threads in the barrier, then some of these threads will idle.

If you want more overlap and if the "other stuff" is independent of the first taskloop, then you can do this:

!$omp parallel
!$omp single


!$omp taskgroup
!$omp taskloop num_tasks(10) nogroup
DO i=1, 10
    A(i) = foo()
END DO
!$omp end taskloop  ! producer will not wait for the loop tasks to complete

!do other stuff

!$omp end taskgroup ! wait for the loop tasks (and their descendant tasks)

!$omp taskloop
DO j=1, 10
    B(j) = A(j)
END DO
!$omp end taskloop

!$omp end single
!$omp end parallel

Alas, the OpenMP API as of version 5.1 does not support task dependences for the taskloop construct, so you cannot easily describe the dependency between the loop iterations of the first taskloop and the second taskloop. The OpenMP language committee is working on this right now, but I do not see this being implemented for the OpenMP API version 5.2, but rather for version 6.0.

PS (EDIT): For the second taskloop as it's right before the end of the single construct and thus right before a barrier, you can easily add the nogroup there as well to avoid that extra bit of waiting for the producer thread.