The two main OpenMP loop scheduling options are static
and dynamic
. The latter can be more efficient if the workloads are not balanced between the iterations, but less efficient in the case of balanced workloads (because the scheduling itself is more expensive).
The nonmonotonic:dynamic
option that appeared in recent specifications is supposed to address this problem, by getting more or less the best of the two worlds: it starts like a static scheduling, but with a chunk size of 1 (by default, like the default chuck size of the dynamic
option), and a thread that completes early all of its pre-attributed threads can "steal" the non-processed chunks from the other threads (at least this is my understanding).
However, when testing it, it's not really what happens. I have tested the following code (in Fortran, but it doesn'tt really matter) with gfortran 13, ifort 18, 19, and 21 (unfortunately I don't have access to more recent versions).
In short, it used 2 threads to execute 4 iterations. The first iteration is artificially delayed to see how the next iterations are attributed to the threads:
program foo
implicit none
integer, parameter :: N=4
integer :: i, t(N), o, order(N)
real :: x(N)
integer, external :: omp_get_thread_num
call omp_set_num_threads(2)
o = 0
!$OMP PARALLEL DO SCHEDULE(static)
do i = 1, N
call loopcontent(i,o,order,x,t)
end do
!$OMP END PARALLEL DO
call print_results("STATIC")
o = 0
!$OMP PARALLEL DO SCHEDULE(dynamic)
do i = 1, N
call loopcontent(i,o,order,x,t)
end do
!$OMP END PARALLEL DO
call print_results("DYNAMIC")
o = 0
!$OMP PARALLEL DO SCHEDULE(nonmonotonic:dynamic)
do i = 1, N
call loopcontent(i,o,order,x,t)
end do
!$OMP END PARALLEL DO
call print_results("NONMONOTONIC:DYNAMIC")
contains
subroutine loopcontent(i,o,order,x,t)
integer, intent(in) :: i
integer, intent(inout) :: o, order(:), t(:)
real, intent(inout) :: x(:)
integer :: j
!$OMP CRITICAL
o = o+1
order(i) = o
!$OMP END CRITICAL
x(i) = 0.0
if (i == 1) then
do j = 1, 10000000
x(i) = cos(x(i))
end do
end if
t(i) = omp_get_thread_num()
end subroutine
subroutine print_results(what)
character(*), intent(in) :: what
print*
print "(A)", "==============================="
print "(A)", what
print "(A)", " loop thread order"
do i = 1, N
print "(3I7,F10.3)", i, t(i), order(i), x(i)
end do
end subroutine
end
The 4 compilers behave the same (and as expected) or the static
and dynamic
cases, so I'm just pasting one output:
===============================
STATIC
loop thread order
1 0 1 0.739
2 0 4 0.000
3 1 2 0.000
4 1 3 0.000
===============================
DYNAMIC
loop thread order
1 0 1 0.739
2 1 2 0.000
3 1 3 0.000
4 1 4 0.000
Regarding the nonmonotonic:dynamic
option, gfortran 13 and ifort 18/21 behave the same, and the execution order is not different from the dynamic
option. Instead, I would expect the thread 1 to execute the iterations 2, 4, 3, in this order (the iteration 3 being "stolen" to the thread 0 once the iteration 2, 4 are completed):
===============================
NONMONOTONIC:DYNAMIC
loop thread order
1 0 1 0.739
2 1 2 0.000
3 1 3 0.000
4 1 4 0.000
And ifort 19 behaves like with the static
option, so still not as expected:
===============================
NONMONOTONIC:DYNAMIC
loop thread order
1 0 1 0.739
2 0 4 0.000
3 1 2 0.000
4 1 3 0.000
Is my understanding of nonmonotonic:dynamic
wrong, or is it an (non-)implementation issue (the compilers recognize the option, but just map it to static
or dynamic
)? Could someone test with a more recent ifx version?
Results obtained by Victor Eijkhout with the Intel ifx 23 compiler (after some modification to the code, likely as a workaround to a compiler bug; see answer below):
===============================
STATIC
loop thread order
1 0 1 0.739
2 0 4 0.000
3 1 2 0.000
4 1 3 0.000
===============================
DYNAMIC
loop thread order
1 0 1 0.739
2 1 4 0.000
3 1 2 0.000
4 1 3 0.000
===============================
NONMONOTONIC:DYNAMIC
loop thread order
1 0 1 0.739
2 1 4 0.000
3 1 2 0.000
4 1 3 0.000
dynamic
and nonmonotonic:dynamic
are identical, but they differ from what is obtained with the other tested compilers. The 5.2 specification of OpenMP indeed says:
If an ordering-modifier is not31 specified, the effect is as if the monotonic modifier is specified if the kind argument is static or an ordered clause is specified on the construct; otherwise, the effect is as if the nonmonotonic modifier is specified
I though that monotonic
was always the default, but it's not and my test should have tested monotonic:dynamic
instead of simply dynamic
.
Anyway, ifx 23 is the only one that shows a behavior that corresponds to a nonmonotonic
intent, that is, a given thread can possibly process iterations in any order (instead of in increasing order). Still, the implementation is unclear:
static case
), I would expect thread 1 to execute the iterations 3/4 (which belong to a single chunk)... and that's it. Instead it executes also the iteration 2 at the end, somehow stealing it from the thread 0, which started handling the first chunk (iterations 1/2).The specification says very little about nonmonotonic
, just:
When the nonmonotonic ordering-modifier is specified then chunks may be assigned to threads in any order
My understanding is that whole chunks are attributed, and once a thread has been assigned a chunk, it always executes all the iterations of the chunk (i.e. iterations cannot be stolen by another thread). So, what ifx 23 does is still unclear.
I have slightly modified the test code to better see the behaviors:
monotonic:dynamic
case, as nonmonotonic
is actually now the default for the dynamic
schedulingdynamic
case), but also chunks of sizes 2 and 4The tests are performed with gfortran 13 and Intel ifx 2024. For each test 4 columns are printed:
The static
and monotonic:dynamic
schedule work as expected for both compilers:
static
schedule, the chunk size is n/nt=4 iterations, the thread 0 executes the iterations 1-4 and the thread 1 executes the iterations 5-8. Because the iteration 1 is much longer than the other ones, the iterations 2-4 are executed after all the other iterationsmonotonic:dynamic
schedule, the chunk size is 1 iteration; the thread 0 takes the iteration 1 and spends the rest of the program execution on it, while the thread 1 takes the iterations 2-8 in that order.For the nonmonotonic:dynamic
schedule, the two compilers behave differently, though. gfortran 13 gives the same result than the monotonic:dynamic
case: probably they didn't have time to develop this new schedule and have just mapped nonmonotonic
to monotonic
; I believe this is authorized by the OpenMP specification for nonmonotonic
, which just opens up the possibility for a thread to iterate backwards, without mandating it.
So, the interesting case is the Intel compiler, which apparently implements the new schedule (and did it at least from ifort 2021, as I got the same results with it than with ifx 2024):
static
schedule with the default chunk size, regardless the chunk size 1, 2, or 4 (note that the initial ditribution would be different with a chunk size that would not be a divider of the number of iterations, though).monotonic
case where there is no pre-attributation (each thread simply dynamically takes the next unprocessed chunk).gfortran 13:
===============================
STATIC
loop thread order
1 0 1 0.739
2 0 6 0.000
3 0 7 0.000
4 0 8 0.000
5 1 2 0.000
6 1 3 0.000
7 1 4 0.000
8 1 5 0.000
===============================
MONOTONIC:DYNAMIC
loop thread order
1 1 1 0.739
2 0 2 0.000
3 0 3 0.000
4 0 4 0.000
5 0 5 0.000
6 0 6 0.000
7 0 7 0.000
8 0 8 0.000
===============================
NONMONOTONIC:DYNAMIC
loop thread order
1 1 1 0.739
2 0 2 0.000
3 0 3 0.000
4 0 4 0.000
5 0 5 0.000
6 0 6 0.000
7 0 7 0.000
8 0 8 0.000
===============================
NONMONOTONIC:DYNAMIC,2
loop thread order
1 1 1 0.739
2 1 8 0.000
3 0 2 0.000
4 0 3 0.000
5 0 4 0.000
6 0 5 0.000
7 0 6 0.000
8 0 7 0.000
===============================
NONMONOTONIC:DYNAMIC,4
loop thread order
1 1 1 0.739
2 1 6 0.000
3 1 7 0.000
4 1 8 0.000
5 0 2 0.000
6 0 3 0.000
7 0 4 0.000
8 0 5 0.000
Intel ifx 2024
===============================
STATIC
loop thread order
1 0 1 0.739
2 0 6 0.000
3 0 7 0.000
4 0 8 0.000
5 1 2 0.000
6 1 3 0.000
7 1 4 0.000
8 1 5 0.000
===============================
MONOTONIC:DYNAMIC
loop thread order
1 1 1 0.739
2 0 2 0.000
3 0 3 0.000
4 0 4 0.000
5 0 5 0.000
6 0 6 0.000
7 0 7 0.000
8 0 8 0.000
===============================
NONMONOTONIC:DYNAMIC
loop thread order
1 0 1 0.739
2 1 8 0.000
3 1 7 0.000
4 1 6 0.000
5 1 2 0.000
6 1 3 0.000
7 1 4 0.000
8 1 5 0.000
===============================
NONMONOTONIC:DYNAMIC,2
loop thread order
1 0 1 0.739
2 0 8 0.000
3 1 6 0.000
4 1 7 0.000
5 1 2 0.000
6 1 3 0.000
7 1 4 0.000
8 1 5 0.000
===============================
NONMONOTONIC:DYNAMIC,4
loop thread order
1 0 1 0.739
2 0 6 0.000
3 0 7 0.000
4 0 8 0.000
5 1 2 0.000
6 1 3 0.000
7 1 4 0.000
8 1 5 0.000
Code:
program foo
implicit none
integer, parameter :: N=8
integer :: i, t(N), o, order(N)
real :: x(N)
integer, external :: omp_get_thread_num
call omp_set_num_threads(2)
o = 0
!$OMP PARALLEL DO SCHEDULE(static)
do i = 1, N
call loopcontent(i,o,order,x,t)
end do
!$OMP END PARALLEL DO
call print_results("STATIC",order,x,t)
o = 0
!$OMP PARALLEL DO SCHEDULE(monotonic:dynamic)
do i = 1, N
call loopcontent(i,o,order,x,t)
end do
!$OMP END PARALLEL DO
call print_results("MONOTONIC:DYNAMIC",order,x,t)
o = 0
!$OMP PARALLEL DO SCHEDULE(nonmonotonic:dynamic)
do i = 1, N
call loopcontent(i,o,order,x,t)
end do
!$OMP END PARALLEL DO
call print_results("NONMONOTONIC:DYNAMIC",order,x,t)
o = 0
!$OMP PARALLEL DO SCHEDULE(nonmonotonic:dynamic,2)
do i = 1, N
call loopcontent(i,o,order,x,t)
end do
!$OMP END PARALLEL DO
call print_results("NONMONOTONIC:DYNAMIC,2",order,x,t)
o = 0
!$OMP PARALLEL DO SCHEDULE(nonmonotonic:dynamic,4)
do i = 1, N
call loopcontent(i,o,order,x,t)
end do
!$OMP END PARALLEL DO
call print_results("NONMONOTONIC:DYNAMIC,4",order,x,t)
contains
subroutine loopcontent(i,o,order,x,t)
integer, intent(in) :: i
integer, intent(inout) :: o, order(:), t(:)
real, intent(inout) :: x(:)
integer :: j
!$OMP CRITICAL
o = o+1
order(i) = o
!$OMP END CRITICAL
x(i) = 0.0
if (i == 1) then
do j = 1, 10000000
x(i) = cos(x(i))
end do
end if
t(i) = omp_get_thread_num()
end subroutine
subroutine print_results(what,order,x,t)
character(*), intent(in) :: what
integer, intent(in) :: t(:), order(:)
real, intent(in) :: x(:)
print*
print "(A)", "==============================="
print "(A)", what
print "(A)", " loop thread order"
do i = 1, N
print "(3I7,F10.3)", i, t(i), order(i), x(i)
end do
end subroutine
end