I am new to Eigen and is writing some simple code to test its performance. I am using a MacBook Pro with M1 Pro chip (I do not know whether the ARM architecture causes the problem). The code is a simple Laplace equation solver
#include <iostream>
#include "mpi.h"
#include "Eigen/Dense"
#include <chrono>
using namespace Eigen;
using namespace std;
const size_t num = 1000UL;
MatrixXd initilize(){
MatrixXd u = MatrixXd::Zero(num, num);
u(seq(1, fix<num-2>), seq(1, fix<num-2>)).setConstant(10);
return u;
}
void laplace(MatrixXd &u){
setNbThreads(8);
MatrixXd u_old = u;
u(seq(1,last-1),seq(1,last-1)) =
(( u_old(seq(0,last-2,fix<1>),seq(1,last-1,fix<1>)) + u_old(seq(2,last,fix<1>),seq(1,last-1,fix<1>)) +
u_old(seq(1,last-1,fix<1>),seq(0,last-2,fix<1>)) + u_old(seq(1,last-1,fix<1>),seq(2,last,fix<1>)) )*4.0 +
u_old(seq(0,last-2,fix<1>),seq(0,last-2,fix<1>)) + u_old(seq(0,last-2,fix<1>),seq(2,last,fix<1>)) +
u_old(seq(2,last,fix<1>),seq(0,last-2,fix<1>)) + u_old(seq(2,last,fix<1>),seq(2,last,fix<1>)) ) /20.0;
}
int main(int argc, const char * argv[]) {
initParallel();
setNbThreads(0);
cout << nbThreads() << endl;
MatrixXd u = initilize();
auto start = std::chrono::high_resolution_clock::now();
for (auto i=0UL; i<100; i++) {
laplace(u);
}
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
// cout << u(seq(0, fix<10>), seq(0, fix<10>)) << endl;
cout << "Execution time (ms): " << duration.count() << endl;
return 0;
}
Compile with gcc
and enable OpenMPI
james@MBP14 tests % g++-11 -fopenmp -O3 -I/usr/local/include -I/opt/homebrew/Cellar/open-mpi/4.1.3/include -o test4 test.cpp
Direct run the binary file
james@MBP14 tests % ./test4
8
Execution time (ms): 273
Run with mpirun
and specify 8 threads
james@MBP14 tests % mpirun -np 8 test4
8
8
8
8
8
8
8
8
Execution time (ms): 348
Execution time (ms): 347
Execution time (ms): 353
Execution time (ms): 356
Execution time (ms): 350
Execution time (ms): 353
Execution time (ms): 357
Execution time (ms): 355
So obviously the matrix operation is not running in parallel, instead, every thread is running the same copy of the code.
What should be done to solve this problem? Do I have some misunderstanding about using OpenMPI
?
You are confusing OpenMPI with OpenMP.
-fopenmp
enables OpenMP. It is one way to parallelize an application by using special #pragma omp
statements in the code. The parallelization happens on a single CPU (or, to be precise, compute node, in case the compute node has multiple CPUs). This allows to employ all cores of that CPU. OpenMP cannot be used to parallelize an application over multiple compute nodes.To use MPI, you need to call "special" functions and do the hard work of distributing data yourself. If you do not do this, calling an application with mpirun
simply creates several identical processes (not threads!) that perform exactly the same computation. You have not parallelized your application, you just executed it 8 times.
There are no compiler flags that enable MPI. MPI is not built into any compiler. Rather, MPI is a standard and OpenMPI is one specific library that implements that standard. You should read a tutorial or book about MPI and OpenMPI (google turned up this one, for example).
Note: Usually, MPI libraries such as OpenMPI ship with executables/scripts (e.g. mpicc
) that behave like compilers. But they are just thin wrappers around compilers such as gcc
. These wrappers are used to automatically tell the actual compiler the include directories and libraries to link with. But again, the compilers themselves to not know anything about MPI.