behavior of MPI_Barrier()?

As i understand,this is used to bring all the processes at same level. I need to find the overall processing time for a openMPI program(time at which all the processes are finished) , so i think to put a MPI_Barrier() at the last and then print MPI_Wtime()-t at the last will print the time at which all the processes are finished.

        MPI_stuff;//whatever i want my program to do
        MPI_Barrier(MPI_COMM_WORLD);
        cout << "final time ::: :: " << MPI_Wtime()-t << rank  << endl;
        MPI_Finalize();

but the time when i use the MPI_Barrier() is much different than the case of individual process MPI_Wtime()-t

Solution

It is very easy for MPI processes to become desynchronised in time, especially if the algorithms involved in MPI_stuff are not globally synchronous. It is very typical with most cluster MPI implementations that processes are quite desynchronised from the very beginning due to the different start-up times and the fact that MPI_Init() can take varying amount of time. Another source of desynchronisation is the OS noise, i.e. other processes occasionally sharing CPU time with some of the processes in the MPI job.

That's why the correct way to measure the execution time of a parallel algorithm is to put a barrier before and after the measured block:

MPI_Barrier(MPI_COMM_WORLD); // Bring all processes in sync
t = -MPI_Wtime();
MPI_stuff;
MPI_Barrier(MPI_COMM_WORLD); // Wait for all processes to finish processing
t += MPI_Wtime();

If the first MPI_Barrier is missing and MPI_stuff does not synchronise the different processes, it could happen that some of them arrive at the next barrier very early while others arrive very late, and then the early ones have to wait for the late ones.

Also note that MPI_Barrier gives no guarantee that all processes exit the barrier at the same time. It only guarantees that there is a point in time when the execution flow in all processes is inside the MPI_Barrier call. Everything else is implementation dependent. On some platforms, notably the IBM Blue Gene, global barriers are implemented using a special interrupt network and there MPI_Barrier achieves almost cycle-perfect synchronisation. On clusters barriers are implemented with message passing and therefore barrier exit times might vary a lot.