OMP_set_dynamic() less helpful than what I expect?

I hope everyone is doing well!

I am new to OpenMP. And I guess this question is very basic. But I couldn't find a good answer and looking forward to any advice.

I am running my code using OpenMP on the cluster. I have CALL OMP_set_dynamic(.TRUE.) in my code. And I use OpenMP to parallelize the following loop:

!$OMP PARALLEL DO COLLAPSE(2) DEFAULT(PRIVATE) SHARED(var1, var2) SCHEDULE(DYNAMIC)
DO i = 1,NI
  DO j = 1,NJ
    ......
  END DO
END DO
!$OMP END PARALLEL DO

I set NI = 123, NJ=121.

I think I have access to at least 20 CPUs on the cluster. The htop shows that I have 24 actually. But when I run the code, it seems like not all CPUs are used. The htop displays "Tasks:43, 115 thr; 13 running". (Sorry that I haven't been allowed to insert a picture.) I also try another version without COLLAPSE(2), but no big difference.

My guess is that I don't have enough tasks to allocate to all CPUs. But I am not sure. Any advice is appreciated!

And I am also looking forward to any advice to take full advantage of the cluster besides my usage of CALL OMP_set_dynamic(.TRUE.) and SCHEDULE(DYNAMIC). Also welcome to comment on whether I should use COLLAPSE(2) or not. Thanks!

Solution

OMP_set_dynamic() and SCHEDULE(DYNAMIC) are two unrelated things.

The former controls dynamic parallelism in OpenMP, which is the ability of the runtime to decide how many threads are needed for a particular parallel region and start less than the specified maximum number of threads. For example, if you have a combined PARALLEL SECTIONS construct with two sections inside and you have OMP_NUM_THERADS set to 24, it makes no sense to spawn 24 threads for that region when there are only two tasks (sections) to execute.

The latter tells OpenMP to how to schedule the loop iterations between the threads. DYNAMIC scheduling policy distributes the iterations on a first come first served basis. Without a chunk size specification, e.g., SCHEDULE(DYNAMIC,100), dynamic schedule defaults to chunk size of 1, which means each iteration becomes an OpenMP task of its own. Given the COLLAPSE(2) clause, you have an iteration space of NI * NJ or 14883 iterations and the same amount of OpenMP tasks. If you don't do enough work in the inner loop, the OpenMP overhead will completely swamp any performance benefits you get from running in parallel.

The only use of dynamic scheduling is for situations where the work done in the loop body may differ wildly from iteration to iteration. A typical example is rendering the Mandelbrot fractal. It takes far less iterations for points away from the Mandelbrot set to escape the attractor than for points near it. If the work per iteration is constant, e.g., you multiply each element of a matrix by a constant factor, then you shouldn't use dynamic but rather static scheduling.

Whether you should use COLLAPSE(2) or not squarely depends on what is going on in the inner loop. Collapsing the iteration space results in the use of integer division and modulo operations for reconstructing the original loop variables, which may slow down tight loops. Also, it may prohibit vectorisation, which may or may not have a negative effect.

As pointed by Vladimir F, it is hard to optimise when you have your hands on the hardware and practically impossible when you don't have a clue about the code. Although there are general things that you may be doing wrong, I'd suggest that you learn how to use tools such as Intel® VTune™ Profiler to profile your code and look for inefficiencies and performance bottlenecks on your own. If you don't have access to such tools on the cluster, install them on your own computer. OpenMP is well supported by all modern compilers except for MS Visual C++ (which supports a very old version only) and it runs everywhere.