Search code examples
fortranopenmp

How does work the OpenMP "nonmonotonic:dynamic" schedule?


The two main OpenMP loop scheduling options are static and dynamic. The latter can be more efficient if the workloads are not balanced between the iterations, but less efficient in the case of balanced workloads (because the scheduling itself is more expensive).

The nonmonotonic:dynamic option that appeared in recent specifications is supposed to address this problem, by getting more or less the best of the two worlds: it starts like a static scheduling, but with a chunk size of 1 (by default, like the default chuck size of the dynamic option), and a thread that completes early all of its pre-attributed threads can "steal" the non-processed chunks from the other threads (at least this is my understanding).

However, when testing it, it's not really what happens. I have tested the following code (in Fortran, but it doesn'tt really matter) with gfortran 13, ifort 18, 19, and 21 (unfortunately I don't have access to more recent versions).

In short, it used 2 threads to execute 4 iterations. The first iteration is artificially delayed to see how the next iterations are attributed to the threads:

program foo
implicit none

    integer, parameter :: N=4
    integer :: i, t(N), o, order(N)
    real :: x(N)
    integer, external :: omp_get_thread_num

    call omp_set_num_threads(2)

    o = 0
    !$OMP PARALLEL DO SCHEDULE(static)
    do i = 1, N
        call loopcontent(i,o,order,x,t)
    end do
    !$OMP END PARALLEL DO
    call print_results("STATIC")

    o = 0
    !$OMP PARALLEL DO SCHEDULE(dynamic)
    do i = 1, N
        call loopcontent(i,o,order,x,t)
    end do
    !$OMP END PARALLEL DO
    call print_results("DYNAMIC")

    o = 0
    !$OMP PARALLEL DO SCHEDULE(nonmonotonic:dynamic)
    do i = 1, N
        call loopcontent(i,o,order,x,t)
    end do
    !$OMP END PARALLEL DO
    call print_results("NONMONOTONIC:DYNAMIC") 

contains 

    subroutine loopcontent(i,o,order,x,t)
        integer, intent(in) :: i
        integer, intent(inout) :: o, order(:), t(:)
        real, intent(inout) :: x(:)

        integer :: j

        !$OMP CRITICAL
        o = o+1
        order(i) = o
        !$OMP END CRITICAL
        x(i) = 0.0
        if (i == 1) then
            do j = 1, 10000000
                x(i) = cos(x(i))
            end do
        end if
        t(i) = omp_get_thread_num()
    end subroutine

    subroutine print_results(what)
        character(*), intent(in) :: what

        print*
        print "(A)", "==============================="
        print "(A)", what
        print "(A)", "    loop  thread  order"
        do i = 1, N
            print "(3I7,F10.3)", i, t(i), order(i), x(i)
        end do
    end subroutine

end

The 4 compilers behave the same (and as expected) or the static and dynamic cases, so I'm just pasting one output:

===============================
STATIC
    loop  thread  order
      1      0      1     0.739
      2      0      4     0.000
      3      1      2     0.000
      4      1      3     0.000

===============================
DYNAMIC
    loop  thread  order
      1      0      1     0.739
      2      1      2     0.000
      3      1      3     0.000
      4      1      4     0.000

Regarding the nonmonotonic:dynamic option, gfortran 13 and ifort 18/21 behave the same, and the execution order is not different from the dynamic option. Instead, I would expect the thread 1 to execute the iterations 2, 4, 3, in this order (the iteration 3 being "stolen" to the thread 0 once the iteration 2, 4 are completed):

===============================
NONMONOTONIC:DYNAMIC
    loop  thread  order
      1      0      1     0.739
      2      1      2     0.000
      3      1      3     0.000
      4      1      4     0.000

And ifort 19 behaves like with the static option, so still not as expected:

===============================
NONMONOTONIC:DYNAMIC
    loop  thread  order
      1      0      1     0.739
      2      0      4     0.000
      3      1      2     0.000
      4      1      3     0.000

Is my understanding of nonmonotonic:dynamic wrong, or is it an (non-)implementation issue (the compilers recognize the option, but just map it to static or dynamic)? Could someone test with a more recent ifx version?


EDIT

Results obtained by Victor Eijkhout with the Intel ifx 23 compiler (after some modification to the code, likely as a workaround to a compiler bug; see answer below):

===============================
STATIC
    loop  thread  order
      1      0      1     0.739
      2      0      4     0.000
      3      1      2     0.000
      4      1      3     0.000

===============================
DYNAMIC
    loop  thread  order
      1      0      1     0.739
      2      1      4     0.000
      3      1      2     0.000
      4      1      3     0.000

===============================
NONMONOTONIC:DYNAMIC
    loop  thread  order
      1      0      1     0.739
      2      1      4     0.000
      3      1      2     0.000
      4      1      3     0.000

dynamic and nonmonotonic:dynamic are identical, but they differ from what is obtained with the other tested compilers. The 5.2 specification of OpenMP indeed says:

If an ordering-modifier is not31 specified, the effect is as if the monotonic modifier is specified if the kind argument is static or an ordered clause is specified on the construct; otherwise, the effect is as if the nonmonotonic modifier is specified

I though that monotonic was always the default, but it's not and my test should have tested monotonic:dynamic instead of simply dynamic.

Anyway, ifx 23 is the only one that shows a behavior that corresponds to a nonmonotonic intent, that is, a given thread can possibly process iterations in any order (instead of in increasing order). Still, the implementation is unclear:

  • if the chunk size is 1, I would expect thread 1 to execute the iterations 2/4/3, in the that order. Instead, it executes iterations 3/4/2 (note that this is compatible with the specification, but why starting with the 3rd iteration, which is also the 3rd chunk?)
  • if the chunk size is 2 (as in the static case), I would expect thread 1 to execute the iterations 3/4 (which belong to a single chunk)... and that's it. Instead it executes also the iteration 2 at the end, somehow stealing it from the thread 0, which started handling the first chunk (iterations 1/2).

The specification says very little about nonmonotonic, just:

When the nonmonotonic ordering-modifier is specified then chunks may be assigned to threads in any order

My understanding is that whole chunks are attributed, and once a thread has been assigned a chunk, it always executes all the iterations of the chunk (i.e. iterations cannot be stolen by another thread). So, what ifx 23 does is still unclear.


Solution

  • I have slightly modified the test code to better see the behaviors:

    • n=8 iterations instead of 4 (still with nt=2 threads)
    • as before, only the first has a significant (artificial) workload to ensure it's delayed and takes much more time than all other iterations, which basically do nothing
    • testing explicitly the monotonic:dynamic case, as nonmonotonic is actually now the default for the dynamic scheduling
    • testing the default chunk (which is 1 for the dynamic case), but also chunks of sizes 2 and 4

    The tests are performed with gfortran 13 and Intel ifx 2024. For each test 4 columns are printed:

    • The iteration number
    • The thread number that executes the given iteration
    • The "order" of the given iteration (e.g. "3" means that this iterations starts executing after 2 other iterations)
    • a real number that is supposed to be 0.739 for the 1st iteration after some fake computations, and 0.000 for the other iterations.

    The static and monotonic:dynamic schedule work as expected for both compilers:

    • In the static schedule, the chunk size is n/nt=4 iterations, the thread 0 executes the iterations 1-4 and the thread 1 executes the iterations 5-8. Because the iteration 1 is much longer than the other ones, the iterations 2-4 are executed after all the other iterations
    • In the monotonic:dynamic schedule, the chunk size is 1 iteration; the thread 0 takes the iteration 1 and spends the rest of the program execution on it, while the thread 1 takes the iterations 2-8 in that order.

    For the nonmonotonic:dynamic schedule, the two compilers behave differently, though. gfortran 13 gives the same result than the monotonic:dynamic case: probably they didn't have time to develop this new schedule and have just mapped nonmonotonic to monotonic; I believe this is authorized by the OpenMP specification for nonmonotonic, which just opens up the possibility for a thread to iterate backwards, without mandating it.

    So, the interesting case is the Intel compiler, which apparently implements the new schedule (and did it at least from ifort 2021, as I got the same results with it than with ifx 2024):

    • The default chunk size is still 1, BUT in contrast to the monotonic case the thread 1 starts with the iteration 5 instead of 2. Then it executes the iterations 6, 7, 8. At this point the thread 0 is still occupied with iteration 1, and the thread 0 starts going backward by executing the iterations 4, 3, 2 in that order.
    • What happens:
      • the initial distribution of the iterations is the same as the static schedule with the default chunk size, regardless the chunk size 1, 2, or 4 (note that the initial ditribution would be different with a chunk size that would not be a divider of the number of iterations, though).
      • all the iterations are pre-attributed from the beginning, in constrast to the usual monotonic case where there is no pre-attributation (each thread simply dynamically takes the next unprocessed chunk).
      • once a thread has finished with the iterations that were pre-distributed to it, it starts "stealing" the unprocessed iterations that were pre-attributed to the other threads
      • as shown by the cases with chunk sizes 2 and 4, a thread won't steal iterations from a chunk that another thread has started to process: the distribution remains chunk based (all the iterations of a sigle thread are processed by a single thread.

    gfortran 13:

    ===============================
    STATIC
        loop  thread  order
          1      0      1     0.739
          2      0      6     0.000
          3      0      7     0.000
          4      0      8     0.000
          5      1      2     0.000
          6      1      3     0.000
          7      1      4     0.000
          8      1      5     0.000
    
    ===============================
    MONOTONIC:DYNAMIC
        loop  thread  order
          1      1      1     0.739
          2      0      2     0.000
          3      0      3     0.000
          4      0      4     0.000
          5      0      5     0.000
          6      0      6     0.000
          7      0      7     0.000
          8      0      8     0.000
    
    ===============================
    NONMONOTONIC:DYNAMIC
        loop  thread  order
          1      1      1     0.739
          2      0      2     0.000
          3      0      3     0.000
          4      0      4     0.000
          5      0      5     0.000
          6      0      6     0.000
          7      0      7     0.000
          8      0      8     0.000
    
    ===============================
    NONMONOTONIC:DYNAMIC,2
        loop  thread  order
          1      1      1     0.739
          2      1      8     0.000
          3      0      2     0.000
          4      0      3     0.000
          5      0      4     0.000
          6      0      5     0.000
          7      0      6     0.000
          8      0      7     0.000
    
    ===============================
    NONMONOTONIC:DYNAMIC,4
        loop  thread  order
          1      1      1     0.739
          2      1      6     0.000
          3      1      7     0.000
          4      1      8     0.000
          5      0      2     0.000
          6      0      3     0.000
          7      0      4     0.000
          8      0      5     0.000
    

    Intel ifx 2024

    ===============================
    STATIC
        loop  thread  order
          1      0      1     0.739
          2      0      6     0.000
          3      0      7     0.000
          4      0      8     0.000
          5      1      2     0.000
          6      1      3     0.000
          7      1      4     0.000
          8      1      5     0.000
     
    ===============================
    MONOTONIC:DYNAMIC
        loop  thread  order
          1      1      1     0.739
          2      0      2     0.000
          3      0      3     0.000
          4      0      4     0.000
          5      0      5     0.000
          6      0      6     0.000
          7      0      7     0.000
          8      0      8     0.000
     
    ===============================
    NONMONOTONIC:DYNAMIC
        loop  thread  order
          1      0      1     0.739
          2      1      8     0.000
          3      1      7     0.000
          4      1      6     0.000
          5      1      2     0.000
          6      1      3     0.000
          7      1      4     0.000
          8      1      5     0.000
     
    ===============================
    NONMONOTONIC:DYNAMIC,2
        loop  thread  order
          1      0      1     0.739
          2      0      8     0.000
          3      1      6     0.000
          4      1      7     0.000
          5      1      2     0.000
          6      1      3     0.000
          7      1      4     0.000
          8      1      5     0.000
     
    ===============================
    NONMONOTONIC:DYNAMIC,4
        loop  thread  order
          1      0      1     0.739
          2      0      6     0.000
          3      0      7     0.000
          4      0      8     0.000
          5      1      2     0.000
          6      1      3     0.000
          7      1      4     0.000
          8      1      5     0.000
    

    Code:

    program foo
    implicit none
    
        integer, parameter :: N=8
        integer :: i, t(N), o, order(N)
        real :: x(N)
        integer, external :: omp_get_thread_num
    
        call omp_set_num_threads(2)
    
        o = 0
        !$OMP PARALLEL DO SCHEDULE(static)
        do i = 1, N
            call loopcontent(i,o,order,x,t)
        end do
        !$OMP END PARALLEL DO
        call print_results("STATIC",order,x,t)
    
        o = 0
        !$OMP PARALLEL DO SCHEDULE(monotonic:dynamic)
        do i = 1, N
            call loopcontent(i,o,order,x,t)
        end do
        !$OMP END PARALLEL DO
        call print_results("MONOTONIC:DYNAMIC",order,x,t)
    
        o = 0
        !$OMP PARALLEL DO SCHEDULE(nonmonotonic:dynamic)
        do i = 1, N
            call loopcontent(i,o,order,x,t)
        end do
        !$OMP END PARALLEL DO
        call print_results("NONMONOTONIC:DYNAMIC",order,x,t) 
    
        o = 0
        !$OMP PARALLEL DO SCHEDULE(nonmonotonic:dynamic,2)
        do i = 1, N
            call loopcontent(i,o,order,x,t)
        end do
        !$OMP END PARALLEL DO
        call print_results("NONMONOTONIC:DYNAMIC,2",order,x,t) 
    
        o = 0
        !$OMP PARALLEL DO SCHEDULE(nonmonotonic:dynamic,4)
        do i = 1, N
            call loopcontent(i,o,order,x,t)
        end do
        !$OMP END PARALLEL DO
        call print_results("NONMONOTONIC:DYNAMIC,4",order,x,t) 
    
    contains 
    
        subroutine loopcontent(i,o,order,x,t)
            integer, intent(in) :: i
            integer, intent(inout) :: o, order(:), t(:)
            real, intent(inout) :: x(:)
    
            integer :: j
    
            !$OMP CRITICAL
            o = o+1
            order(i) = o
            !$OMP END CRITICAL
            x(i) = 0.0
            if (i == 1) then
                do j = 1, 10000000
                    x(i) = cos(x(i))
                end do
            end if
            t(i) = omp_get_thread_num()
        end subroutine
    
        subroutine print_results(what,order,x,t)
            character(*), intent(in) :: what
             integer, intent(in) :: t(:), order(:)
             real, intent(in) :: x(:)
    
            print*
            print "(A)", "==============================="
            print "(A)", what
            print "(A)", "    loop  thread  order"
            do i = 1, N
                print "(3I7,F10.3)", i, t(i), order(i), x(i)
            end do
        end subroutine
    
    end