OpenMP threads not activating when run with mpirun

While trying to run a hybrid MPI/OpenMP application I realized that the number of OpenMP threads was always 1, even though I exported OMP_NUM_THREAS=36. I build a small C++ example showing the issue:

#include <vector>
#include "math.h"

int main ()
{
    int n=4000000,  m=1000;
    double x=0,y=0;
    double s=0;
    std::vector< double > shifts(n,0);


    #pragma omp parallel for reduction(+:x,y)
    for (int j=0; j<n; j++) {

        double r=0.0;
        for (int i=0; i < m; i++){

            double rand_g1 = cos(i/double(m));
            double rand_g2 = sin(i/double(m));

            x += rand_g1;
            y += rand_g2;
            r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
        }
        shifts[j] = r / m;
    }
}

I compile the code using g++:

g++ -fopenmp main.cpp

OMP_NUM_THREADS is still set to 36. When I run the code with just:

time ./a.out

I get a run-time of about 6 seconds and htop shows the command using all 36 cores of my local node, as expected. When I run it with mpirun:

time mpirun -np 1 ./a.out

I get a run-time of 3m20s and htop shows the command is using only on one core. I've also tried using mpirun -np 1 -x OMP_NUM_THREADS=36 ./a.out but results were the same.

I am using GCC 9.2.0 and OpenMPI 4.1.0a1. Since this is a developer version, I've also tried with OpenMPI 4.0.3 with the same result.

Any idea what I am missing?

Solution

The default behavior of Open MPI is to

bind a MPI task on a core if there are two or less MPI tasks
bind a MPI task to a socket otherwise

So you really should

mpirun --bind-to none -np 1 ./a.out

so your MPI task can access all the cores of your host.