Search code examples
c++multithreadingperformanceopenmphtop

htop and OpenMP threads


In my main function I set:

omp_set_num_threads(20);

which tells OpenMP to use 20 threads (have 40 threads available).

I then execute my code which contains the directive:

#pragma omp parallel for shared(x,y,z)

for the main for loop, and monitor CPU usage through htop (maybe not the best way, but still). There are 50 "tasks" the for loop must execute and each take quite a while. What I observe through htop, is that after tasks finish up, the thread count drops. Specifically, using 20 threads I expect ed to see a 2000% cpu usage until there were less than 20 tasks remaining after which the threads should "free" themselves. However, what I am seeing is first 2000%, and after n tasks has completed I see 2000% - (n*100%) performance. Thus it seems that as the tasks complete, the threads shut down rather than picking up new tasks.

Is this to be expected or does that sound odd?


Solution

  • The default parallel loop scheduling for virtually all existing OpenMP compilers is static, which means that the OpenMP runtime will try to split the iteration space evenly between the threads and do a static work assignment. Since you have 50 iterations and 20 threads, the work cannot be split equally as 20 does not divide 50. Therefore, half of the threads will do three iterations while the other half will do two iterations.

    There is an implicit barrier at the end of the (combined parallel) for construct where the threads that finish earlier wait for the rest of the threads to complete. Depending on the OpenMP implementation, the barrier might be implemented as busy waiting loop, as a wait operation on some OS synchronisation object, or as a combination of both. In the latter two cases, the CPU usage of the threads that hit the barrier will either immediately drop to zero as they go into interruptible sleep or will initially remain at 100% for a short time (the busy loop) and then drop to zero (the wait).

    If the loop iterations take exactly the same amount of time, then what will happen is that the CPU usage will be 2000% initially, then after two iterations (and a bit more if the barrier implementation uses a short busy loop) will drop to 1000%. If the iterations take different amount of time each, then the threads will arrive at different moments at the barrier and the CPU usage will decrease gradually.

    In any case, use schedule(dynamic) to have each iteration given to the first thread to become available. This will improve the CPU utilisation in the case when the iterations are taking varying amount of time. It will not help when the iterations are taking the same amount of time each. The solution in that latter case will be to have the number of iterations as an integer multiple of the number of threads.