Search code examples
c++cipcmpiopenmpi

Huge difference in MPI_Wtime() after using MPI_Barrier()?


This is the part of the code.

    if(rank==0) {   
        temp=10000; 
        var=new char[temp] ;
        MPI_Send(&temp,1,MPI_INT,1,tag,MPI_COMM_WORLD); 
        MPI_Send(var,temp,MPI_BYTE,1,tag,MPI_COMM_WORLD);
            //MPI_Wait(&req[0],&sta[1]);
    }
    if(rank==1) {
        MPI_Irecv(&temp,1,MPI_INT,0,tag,MPI_COMM_WORLD,&req[0]);
        MPI_Wait(&req[0],&sta[0]);
        var=new char[temp] ;
        MPI_Irecv(var,temp,MPI_BYTE,0,tag,MPI_COMM_WORLD,&req[1]);
        MPI_Wait(&req[0],&sta[0]);
    }
    //I am talking about this MPI_Barrier


    MPI_Barrier(MPI_COMM_WORLD);
    cout << MPI_Wtime()-t1 << endl ;
    cout << "hello " << rank  << " " << temp << endl ;
        MPI_Finalize();
}

1. when using MPI_Barrier - As expected all the process are taking almost same amount of time, which is of order 0.02

2. when not using MPI_Barrier() - the root process(sending a message) waiting for some extra time . and the (MPI_Wtime -t1) varies a lot and the time taken by root process is of order 2 seconds.

If i am not really mistaken MPI_Barrier is only used to bring all the running processes at the same level. so why don't the time when i am using MPI_Barrier() is 2 seconds (minimum of all processes . e. root process) . Please explain ?


Solution

  • Thanks to Wesley Bland for noticing that you are waiting twice on the same request. Here is an explanation of what actually happens.

    There is something called progression of asynchronous (non-blocking) operations in MPI. That is when the actual transfer happens. Progression could happen in many different ways and at many different points within the MPI library. When you post an asynchronous operation, its progression could be deferred indefinitely, even until the point that one calls MPI_Wait, MPI_Test or some call that would result in new messages being pushed to or pulled from the transmit/receive queue. That's why it is very important to call MPI_Wait or MPI_Test as quickly as possible after the initiation of a non-blocking operation.

    Open MPI supports a background progression thread that takes care to progress the operations even if the condition in the previous paragraph is not met, e.g. if MPI_Wait or MPI_Test is never called on the request handle. This has to be explicitly enabled when the library is being built. It is not enabled by default since background progression increases the latency of the operations.

    What happens in your case is that you are waiting on the incorrect request the second time you call MPI_Wait in the receiver, therefore the progression of the second MPI_Irecv operation is postponed. The message is more than 40 KiB in size (10000 times 4 bytes + envelope overhead) which is above the default eager limit in Open MPI (32 KiB). Such messages are sent using the rendezvous protocol that requires both the send and the receive operations to be posted and progressed. The receive operation doesn't get progressed and hence the send operation in rank 0 blocks until at some point in time the clean-up routines that MPI_Finalize in rank 1 calls eventually progress the receive.

    When you put the call to MPI_Barrier, it leads to the progression of the outstanding receive, acting almost like an implicit call to MPI_Wait. That's why the send in rank 0 completes quickly and both processes move on in time.

    Note that MPI_Irecv, immediately followed by MPI_Wait is equivalent to simply calling MPI_Recv. The latter is not only simpler, but also less prone to simple typos like the one that you've made.