OpenMP with OpenMPI

I have an MPI application that currently has one process (call it A) which is causing serious problems for scalability. Currently, all the other processes are sitting in an MPI_Recv waiting for that one process to send them information.

Since I want to speed this up now with as little effort as possible, I was thinking about using OpenMP parallelize process A. Is this practical?

Since the other processes sharing a node with A are in an MPI_Recv, can I utilize all the resources from that node to work on process A, or will the MPI_Recv prevent that?

The other benefit of using OpenMP is that the memory can be shared since process A takes a lot.

By the way, does it change anything if my processors are waiting in an MPI_Send instead of an MPI_Recv?

Solution

Yes, it is possible to use OpenMP to parallelize a certain process locally combined with OpenMPI that takes care of work distribution (i.e. OpenMPI across nodes and OpenMP within nodes). This concept is known as Hybrid Programming with OpenMP and MPI (if you google for this you will find several useful links).

MPI_Send and MPI_Recv calls are blocking calls (for detailed information you can check this post In message passing (MPI) mpi_send and recv “what waits”), which means that if your nodes are blocked in MPI_Recv they will be blocked waiting for data. However, you can use the respective asynchronous methods MPI_Isend and MPI_Irecv for performance at the cost of having to deal with race conditions and careful buffer handling. An example and further information can be found here.

In my opinion you have two choices:

Evenly distribute your workload using OpenMPI and then use OpenMP to parallelize your workload locally (if you have several cores and several nodes with several cores you can use OpenMP to assign tasks to each core; OpenMPI to distribute parts of it through the nodes which can then take advantage of the local architecture of each node and use OpenMP);
Reprogram your program to use the asynchronous methods in order to have other nodes helping node A in its computations if necessary.

I hope this helps.