MPI blocks execution after send when different workloads associated to a processor are used

I am having some problems with an MPI code (written by me to test for another program where different workloads are associated to different processors). The problem is that when I use a different number of processors than 1 or arraySize(4 in this case), the program is blocked during MPI_Send, in particular when I run mpirun -np 2 MPItest the program is blocked during the call. I am not using any debugger for now, i just want to understand why it works with 1 and 4 processors, but it does not with 2 processors(2 spots in the array per processor), the code is below:

#include <mpi.h>
#include <iostream>

int main(int argc, char** argv) {
    int rank, size;
    const int arraySize = 4;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    // every processor have a different workload (1 or more spots on the array to send to the other processors)
    // every processor sends to every other processor its designated spots


    int* sendbuf = new int[arraySize];
    int* recvbuf = new int[arraySize];

    int istart = arraySize/size * rank;
    int istop = (rank == size) ? arraySize : istart + arraySize/size;

    for (int i = istart; i < istop; i++) {
        sendbuf[i] = i;
    }

    std::cout << "Rank " << rank << " sendbuf :" << std::endl;
    //print the sendbuf before receiving its other values
    for (int i = 0; i < arraySize; i++) {
        std::cout << sendbuf[i] << ", ";
    }
    std::cout << std::endl;

    // sending designated spots of sendbuf to other processors
    for(int i = istart; i < istop; i++){
        for(int j = 0; j < size; j++){
            MPI_Send(&sendbuf[i], 1, MPI_INT, j, i, MPI_COMM_WORLD);
        }
    }

    // receiving the full array
    for(int i = 0; i < arraySize ; i++){
        int recvRank = i/(arraySize/size);
        MPI_Recv(&recvbuf[i], 1, MPI_INT, recvRank, i, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    }


    // print the recvbuf after receiving its other values
    std::cout << "Rank " << rank << " recvbuf :" << std::endl;
    for (int i = 0; i < arraySize; i++) {
        std::cout << recvbuf[i] << ", ";
    }
    std::cout << std::endl;

    delete[] sendbuf;
    delete[] recvbuf;

    MPI_Finalize();
    return 0;
}

I am using the tags to differentiate between different spots in the array (maybe that is the problem?)

I tried using different numbers of processors, with 1 processor the program works, also with 4 processors the program also works, with 3 processors it crashes, and with 2 processors the program is blocked. I also tried using MPI_Isend but it doesn't work either (the flag is 0), the modified code with MPI_Isend is below:

#include <mpi.h>
#include <iostream>

int main(int argc, char** argv) {
    int rank, size;
    const int arraySize = 4;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    // every processor have a different workload (1 or more spots on the array to send to the other processors)
    // every processor sends to every other processor its designated spots


    int* sendbuf = new int[arraySize];
    int* recvbuf = new int[arraySize];

    int istart = arraySize/size * rank;
    int istop = (rank == size) ? arraySize : istart + arraySize/size;

    for (int i = istart; i < istop; i++) {
        sendbuf[i] = i;
    }

    std::cout << "Rank " << rank << " sendbuf :" << std::endl;
    //print the sendbuf before receiving its other values
    for (int i = 0; i < arraySize; i++) {
        std::cout << sendbuf[i] << ", ";
    }
    std::cout << std::endl;

    // sending designated spots of sendbuf to other processors
    for(int i = istart; i < istop; i++){
        for(int j = 0; j < size; j++){
            MPI_Request request;
            //MPI_Send(&sendbuf[i], 1, MPI_INT, j, i, MPI_COMM_WORLD);
            MPI_Isend(&sendbuf[i], 1, MPI_INT, j, i, MPI_COMM_WORLD, &request);
            // control if the send is completed
            int flag = 0;
            MPI_Test(&request, &flag, MPI_STATUS_IGNORE);
            const int numberOfRetries = 10;
            if(flag == 0){ // operation not completed
                std::cerr << "Error in sending, waiting" << std::endl;
                for(int k = 0; k < numberOfRetries; k++){
                    MPI_Test(&request, &flag, MPI_STATUS_IGNORE);
                    if(flag == 1){
                        break;
                    }
                }
                if(flag == 0){
                    std::cerr << "Error in sending, aborting" << std::endl;
                    MPI_Abort(MPI_COMM_WORLD, 1);
                }
                
            }
        }
    }

    // receiving the full array
    for(int i = 0; i < arraySize ; i++){
        int recvRank = i/(arraySize/size);
        MPI_Recv(&recvbuf[i], 1, MPI_INT, recvRank, i, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
    }


    // print the recvbuf after receiving its other values
    std::cout << "Rank " << rank << " recvbuf :" << std::endl;
    for (int i = 0; i < arraySize; i++) {
        std::cout << recvbuf[i] << ", ";
    }
    std::cout << std::endl;

  
    //MPI_Alltoall(sendbuf, 1, MPI_INT, recvbuf, 1, MPI_INT, MPI_COMM_WORLD);

    delete[] sendbuf;
    delete[] recvbuf;

    MPI_Finalize();
    return 0;
}

with this code, also -np 4 doesn't work either

Solution

Since I have not received any answer yet to the problem, I want to add some insight of my problem to help some people if they find themselves in the same conditions.

I tested another code to see if the OpenMPI standard on my laptop worked well since there were too many problems that were not wrong for the standard and even examples of code on the internet that won't work on my laptop. I tested the following code, a very simple code that sends a part of an array between two processes:

#include <mpi.h>
#include <iostream>

int main(int argc, char** argv) {
    int rank, size;
    const int arraySize = 5;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    // initialize sendbuf
    int* sendbuf = new int[arraySize];
    for(int iteration = 0; iteration < 3; iteration++){

        if(rank){
            std::cout << "Rank " << rank << " sendbuf :" << std::endl;
            for (int i = 0; i < arraySize; i++) {
                std::cout << sendbuf[i] << ", ";
            }
            std::cout << std::endl;
        }

        // first process send first three elements to second process
        if(rank == 0){
            for(int i = 0; i < 3; i++){
                sendbuf[i] = i;
            }
            MPI_Send(&sendbuf[0], 3, MPI_INT, 1, 0, MPI_COMM_WORLD);
        } else {
            for(int i = 3; i < 5; i++){
                sendbuf[i] = i;
            }
        }

        // receive the full array with MPI_Wait
        if(rank){
            // second process receive the first three elements from first process
            MPI_Recv(&sendbuf[0], 3, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        }

        // print the full array
        if(rank){
            std::cout << "Rank " << rank << " sendbuf after:" << std::endl;
            for (int i = 0; i < arraySize; i++) {
                std::cout << sendbuf[i] << ", ";
            }
            std::cout << std::endl;
        }

        // reset MPI requests and buffers
        for(int i = 0; i < arraySize; i++){
            sendbuf[i] = -1;
        }
        
    }

    MPI_Finalize();


}

I wanted to see if a single send and a single receive would work in a loop on my laptop, and surprise to me (after two days of trying everything), it is a problem of my laptop and the OpenMPI implementation. I tested this code on a cluster that I have and where the MPI implementation works to see if it was a problem of my hardware or not. The code works on the cluster, but not on my laptop.

To conclude, this is the hardware that I have:

Kernel: 6.6.1-arch1-1
arch: x86_64
bits: 64
compiler: gcc
model: Lenovo Legion 7 16IAX7
processor: 12th Gen Intel(R) Core(TM) i7-12800HX
OpenMPI version: 4.1.5-5

This is not a solution but answers my question of why the code was not working.

As pointed out by @GillesGouaillardet , it seems it was a problem of the default network interface used with mpirun, specifying a network interface with no firewall rules seems to be the solution.