OpenMp: how to make sure each thread works atleast 1 iteration in dynamic scheduling

I am using dynamic scheduling for the loop iteration. But when the works in each iteration are too small, some threads don't work or when there is a huge amount of threads. Eg. There are 100 iterations and there are 90 threads, I want every thread to do at least one iteration and the rest 10 iterations can be distributed to the threads who have done their job. How can I do that?

Solution

You cannot force the OpenMP runtime to do this. However, you can give hints to the OpenMP runtime so that it will likely do that when (it decide that) it is possible at the cost of a higher overhead. On way is to specify the granularity of the dynamically scheduled loop. Here is an example:

#pragma omp parallel for schedule(dynamic,1)
for(int i=0 ; i<100 ; ++i)
    compute(i);

With such a code, the runtime is free to share the work evenly between threads (using a work-sharing scheduler) or let threads steal the work of a master thread that drive the parallel computation (using a work-stealing scheduler). In the second approach, although the granularity is 1 loop iteration, some threads could steal more work than they actually need (eg. to generally improve performance). If the loop iterations are fast enough, the work will probably not be balanced between threads.

Creating 90 threads is costly and sending work to 90 threads is also far from being free as it is mostly bounded by the relatively high latency of atomic operations, their salability as well as the latency of awaking threads. Moreover, while such operation appear to be synchronous from the user point of view, it is not the case in practice (especially with 90 threads and on multi-socket NUMA-based architectures). As a results, some threads may finish to compute one iteration of the loop while others may not be aware of the parallel computation or not even created yet. The overhead to make threads aware of the computation to be done generally grow as the number of threads used is increased. In some case, this overhead can be higher than the actual computation and it can be more efficient to use less threads.

OpenMP runtime developers should sometimes tread work balancing with smaller communication overheads. Thus those decisions can perform badly in your case but could improve the salability of other kind of applications. This is especially true on work-stealing scheduler (eg. the Clang/ICC OpenMP runtime). Note that improving the scalability of OpenMP runtimes is an ongoing research field.

I advise you to try multiple OpenMP runtimes (including research ones that may or may not be good to use in production code). You can also play with the OMP_WAIT_POLICY variable to reduce the overhead of awaking threads. You can also try to use OpenMP tasks to force a bit more the runtime to not merge iterations. I also advise you to profile your code to see what is going on and find potential software/hardware bottlenecks.

Update

If you use more OpenMP threads than there is hardware threads on your machine, the processor cannot execute them simultaneously (it can only execute one OpenMP thread on each hardware thread). Consequently, the operating systems on your machine schedules the OpenMP threads on the hardware threads so that they seem to be executed simultaneously from the user point of view. However, they are not running simultaneously, but executed in an interleaved way during a very small quantum of time (eg. 100 ms).

For example, if you have a processor with 8 hardware threads and you use 8 OpenMP threads, you can roughly assume that they will run simultaneously. But if you use 16 OpenMP threads, your operating system can choose to schedule them using the following way:

the first 8 threads are executed for 100 ms;
the last 8 threads are executed for 100 ms;
the first 8 threads are executed again for 100 ms;
the last 8 threads are executed again for 100 ms;
etc.

If your computation last for less than 100 ms, the OpenMP dynamic/guided schedulers will move the work of the 8 last threads to the 8 first threads so that the overall execution time will be faster. Consequently, the 8 first threads can execute all the work and the 8 last threads will not have anything to once executed. This is the cause of the work imbalance between threads.

Thus, if you want to measure the performance of an OpenMP program, you shall NOT use more OpenMP threads than hardware threads (unless you exactly know what you are doing and you are fully aware of such effects).