While trying to run a hybrid MPI/OpenMP application I realized that the number of OpenMP threads was always 1, even though I exported OMP_NUM_THREAS=36
. I build a small C++ example showing the issue:
#include <vector>
#include "math.h"
int main ()
{
int n=4000000, m=1000;
double x=0,y=0;
double s=0;
std::vector< double > shifts(n,0);
#pragma omp parallel for reduction(+:x,y)
for (int j=0; j<n; j++) {
double r=0.0;
for (int i=0; i < m; i++){
double rand_g1 = cos(i/double(m));
double rand_g2 = sin(i/double(m));
x += rand_g1;
y += rand_g2;
r += sqrt(rand_g1*rand_g1 + rand_g2*rand_g2);
}
shifts[j] = r / m;
}
}
I compile the code using g++
:
g++ -fopenmp main.cpp
OMP_NUM_THREADS
is still set to 36. When I run the code with just:
time ./a.out
I get a run-time of about 6 seconds and htop
shows the command using all 36 cores of my local node, as expected. When I run it with mpirun
:
time mpirun -np 1 ./a.out
I get a run-time of 3m20s and htop
shows the command is using only on one core. I've also tried using mpirun -np 1 -x OMP_NUM_THREADS=36 ./a.out
but results were the same.
I am using GCC 9.2.0 and OpenMPI 4.1.0a1. Since this is a developer version, I've also tried with OpenMPI 4.0.3 with the same result.
Any idea what I am missing?
The default behavior of Open MPI is to
So you really should
mpirun --bind-to none -np 1 ./a.out
so your MPI task can access all the cores of your host.