Search code examples
cmultithreadingperformanceparallel-processingopenmp

Openmp not speeding up parallel loop


I have the following embarassingly parallel loop

//#pragma omp parallel for
for(i=0; i<tot; i++)
    pointer[i] = val;

Why does uncommenting the #pragma line cause performance to drop? I'm getting a slight increase in program run time when I use openmp to parallelize this for loop. Since each access is independent, shouldn't it greatly increase the speed of the program?

Is it possible that if this for loop isn't run for large values of tot, the overhead is slowing things down?


Solution

  • Achieving performance with multiple threads in a Shared Memory environment usually depends on:

    1. The task granularity;
    2. Load balance between parallel tasks;
    3. The number of parallel task/number of cores used;
    4. The amount of synchronization among parallel tasks;
    5. The type of bound of the algorithm;
    6. The machine architecture.

    I will give a brief overview of each of the aforementioned points.

    1. You need to check if the granularity of the parallel tasks is enough to overcome the overhead of the parallelization (e.g., thread creation and synchronization). Maybe the number of iterations of your loop, and the computation pointer[i] = val; is not enough to justify the overhead of thread creation; Worth-noting, however, that too large of a task granularity can also lead to problems, for instance, load unbalancing.

    2. You have to test the load balance (the amount of work per thread). Ideally, each thread should compute the same amount of work. In your code example this is not problematic;

    3. Are you using hyper-threading?! Are you utilizing more threads than cores?! Because, if you are, threads will start "competing" for resources, and this can lead to a drop in performance;

    4. Usually, one wants to reduce the amount of synchronization among threads. Consequently, sometimes one uses finer-grain synchronization mechanisms and even data redundancy (among other approaches) to achieve that. Your code does not have this issue.

    5. Before attempting to parallelize your code you should analyze if it is memory-bound, CPU-bound, and so on. If it is memory-bound you may start by improving the cache usage, before you tackling the parallelization. For this task, it is highly recommended the use of a profiler.

    6. To extract the most out of the underlining architecture, the multi-threaded approach needs to tackle the constraints of that architecture. For example, implementing an efficient multi-threaded approach to execute in a SMP architecture is different than implementing it to execute in a NUMA architecture. Since in the latter, one has to take into account the memory affinity.

    EDIT: Suggestion from @Hristo lliev

    1. Thread affinity: "Binding threads to cores improves performance in general and even more on NUMA systems since it improves data locality."

    Btw, I recommend you to read this Intel Guide for Developing Multithreaded Applications.