Search code examples
c++parallel-processingeigenapple-m1openmpi

Run Eigen Parallel with OpenMPI


I am new to Eigen and is writing some simple code to test its performance. I am using a MacBook Pro with M1 Pro chip (I do not know whether the ARM architecture causes the problem). The code is a simple Laplace equation solver

#include <iostream>
#include "mpi.h"
#include "Eigen/Dense"
#include <chrono>
 
using namespace Eigen;
using namespace std;

const size_t num = 1000UL;

MatrixXd initilize(){
    MatrixXd u = MatrixXd::Zero(num, num);
    u(seq(1, fix<num-2>), seq(1, fix<num-2>)).setConstant(10);
    return u;
}

void laplace(MatrixXd &u){
    setNbThreads(8);
    MatrixXd u_old = u;

    u(seq(1,last-1),seq(1,last-1)) =
    ((  u_old(seq(0,last-2,fix<1>),seq(1,last-1,fix<1>)) + u_old(seq(2,last,fix<1>),seq(1,last-1,fix<1>)) +
        u_old(seq(1,last-1,fix<1>),seq(0,last-2,fix<1>)) + u_old(seq(1,last-1,fix<1>),seq(2,last,fix<1>)) )*4.0 +
        u_old(seq(0,last-2,fix<1>),seq(0,last-2,fix<1>)) + u_old(seq(0,last-2,fix<1>),seq(2,last,fix<1>)) +
        u_old(seq(2,last,fix<1>),seq(0,last-2,fix<1>))   + u_old(seq(2,last,fix<1>),seq(2,last,fix<1>)) ) /20.0;
}


int main(int argc, const char * argv[]) {
    initParallel();
    setNbThreads(0);
    cout << nbThreads() << endl;
    MatrixXd u = initilize();
    
    auto start  = std::chrono::high_resolution_clock::now();
    
    for (auto i=0UL; i<100; i++) {
        laplace(u);
    }
    
    auto stop  = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
    
    // cout << u(seq(0, fix<10>), seq(0, fix<10>)) << endl;
    cout << "Execution time (ms): " << duration.count() << endl;
    return 0;
}

Compile with gcc and enable OpenMPI

james@MBP14 tests % g++-11 -fopenmp  -O3 -I/usr/local/include -I/opt/homebrew/Cellar/open-mpi/4.1.3/include -o test4 test.cpp

Direct run the binary file

james@MBP14 tests % ./test4
8
Execution time (ms): 273

Run with mpirun and specify 8 threads

james@MBP14 tests % mpirun -np 8 test4
8
8
8
8
8
8
8
8
Execution time (ms): 348
Execution time (ms): 347
Execution time (ms): 353
Execution time (ms): 356
Execution time (ms): 350
Execution time (ms): 353
Execution time (ms): 357
Execution time (ms): 355

So obviously the matrix operation is not running in parallel, instead, every thread is running the same copy of the code.

What should be done to solve this problem? Do I have some misunderstanding about using OpenMPI?


Solution

  • You are confusing OpenMPI with OpenMP.

    • The gcc flag -fopenmp enables OpenMP. It is one way to parallelize an application by using special #pragma omp statements in the code. The parallelization happens on a single CPU (or, to be precise, compute node, in case the compute node has multiple CPUs). This allows to employ all cores of that CPU. OpenMP cannot be used to parallelize an application over multiple compute nodes.
    • On the other hand, MPI (where OpenMPI is one particular implementation) can be used to parallelize a code over multiple compute nodes (i.e., roughly speaking, over multiple computers that are connected). It can also be used to parallelize some code over multiple cores on a single computer. So MPI is more general, but also much more difficult to use.

    To use MPI, you need to call "special" functions and do the hard work of distributing data yourself. If you do not do this, calling an application with mpirun simply creates several identical processes (not threads!) that perform exactly the same computation. You have not parallelized your application, you just executed it 8 times.

    There are no compiler flags that enable MPI. MPI is not built into any compiler. Rather, MPI is a standard and OpenMPI is one specific library that implements that standard. You should read a tutorial or book about MPI and OpenMPI (google turned up this one, for example).

    Note: Usually, MPI libraries such as OpenMPI ship with executables/scripts (e.g. mpicc) that behave like compilers. But they are just thin wrappers around compilers such as gcc. These wrappers are used to automatically tell the actual compiler the include directories and libraries to link with. But again, the compilers themselves to not know anything about MPI.