c++multithreading parallel-processing mpi openmp

Why would omp_set_num_threads( omp_get_num_threads() ) change anything?

I've run across something odd. I am testing an MPI + OMP parallel code on a small local machine with only a single, humble 4 core I3. One of my loops, it turns out, is very slow with more than 1 OMP thread per process in this environment (more threads than cores).

#pragma omp parallel for
for ( int i = 0; i < HEIGHT; ++i ) 
{
    for ( int j = 0; j < WIDTH; ++j ) 
    {
        double a = 
           ( data[ sIdx * S_SZ + j + i * WIDTH ] - dMin ) / ( dMax - dMin );

        buff[ i ][ j ] = ( unsigned char ) ( 255.0 * a );
    }
}

If I run this code with the defaults (without setting OMP_NUM_THREADS, or using omp_set_num_threads), then it takes about 1 s. However, if I explicitly set the number of threads with either method (export OMP_NUM_THREADS=1 or omp_set_num_threads(1)) then it takes about 0.005 s (200X faster).

But it seems that omp_get_num_threads() returns 1 regardless. And in fact, if I just do this omp_set_num_threads( omp_get_num_threads() ); then it takes about 0.005 s, whereas commenting that line out it takes 1 s.

Any idea what is going on here? Why should calling omp_set_num_threads( omp_get_num_threads() ) once at the beginning of a program ever result in a 200X difference in performance?

Some context,

cpu:             Intel(R) Core(TM) i3-9100F CPU @ 3.60GHz
g++ --version:   g++ (GCC) 10.2.0
compiler flags:  mpic++ -std=c++11 -O3 -fpic -fopenmp ...
running program: mpirun -np 4 ./a.out

Solution

I've run across something odd. I am testing an MPI + OMP parallel code on a small local machine with only a single, humble 4 core I3. One of my loops, it turns out, is very slow with more than 1 OMP thread per process in this environment (more threads than cores).

First, without any explicit binding of the OpenMP threads (within the MPI processes) to the cores, one cannot be sure in which cores those threads will end up. Naturally, more often than not, having multiple threads running in the same logical core will increase the overall execution of the application being parallelized. You can solve this issue by either 1) disabling the binding with MPI the flag --bind-to none, to enable threads to be assigned to different cores; 2) or perform the bound of threads, accordingly. Check this SO thread on how to map the threads to cores in Hybrid parallelizations such as MPI + OpenMP.

Notwithstanding, even if one had (let us say) each process map to a core, and 4 threads per core, assuming that each of those cores has two logical cores (i.e., hyper-threading), the overall execution time of the application would most likely be slower than running it with 4 Process x 1 thread. In the current, context, one could hope for a performance improvement maybe (at most) with 4 Process x 2 threads.

But it seems that omp_get_num_threads() returns 1 regardless. And in fact, if I just do this omp_set_num_threads( omp_get_num_threads() );

From source one can read:

2.15 omp_get_num_threads – Size of the active team

Description: *Returns the number of threads in the current team. In a sequential section of the program omp_get_num_threads returns 1.

Informally, if one calls omp_get_num_threads() outside a parallel region, one will get 1 as the number of threads, i.e., the initial thread.

Why should calling omp_set_num_threads( omp_get_num_threads() ) once at the beginning of a program ever result in a 200X difference in performance?

The root cause of the problem is not the call omp_set_num_threads( omp_get_num_threads() ) per si, but rather the fact that threads are fighting for resources. By explicitly setting the number of threads per process to 1, you ensured that the application ran with 1 thread per core, which consequently lead to not having multiple threads within the same core fighting for resources.