mpi time-complexity complexity-theory openmpi hpc

MPI:How to measure actual time correctly with or without MPI_Barrier?

My MPI Program to measure broadcast time:

MPI_Barrier(MPI_COMM_WORLD); 
total_mpi_bcast_time -= MPI_Wtime(); 
MPI_Bcast(data, num_elements, MPI_INT, 0, MPI_COMM_WORLD); 
MPI_Barrier(MPI_COMM_WORLD); 
total_mpi_bcast_time += MPI_Wtime();

We need MPI_Barrier to wait until all processes do its jobs completed (synchronization) .But in fact, MPI_Barrier is a collective communication(all processes report to root process to continue program).And so our measured time will be Barrier_time + Broadcast_time. So how to measure only broadcast time correctly ???
This is result from Scalasca:

Estimated aggregate size of event trace:                   1165 bytes
Estimated requirements for largest trace buffer (max_buf): 292 bytes
Estimated memory requirements (SCOREP_TOTAL_MEMORY):       4097kB
(hint: When tracing set SCOREP_TOTAL_MEMORY=4097kB to avoid intermediate flushes
or reduce requirements using USR regions filters.)

flt     type max_buf[B] visits time[s] time[%] time/visit[us]  region
        ALL     291       32   0.38    100.0       11930.30  ALL
        MPI     267       28   0.38    100.0       13630.27  MPI
        COM     24        4    0.00     0.0          30.54  COM

        MPI     114       8    0.00     0.1          33.08  MPI_Barrier
        MPI     57        4    0.00     0.0          26.53  MPI_Bcast
        MPI     24        4    0.00     0.2         148.50  MPI_Finalize
        MPI     24        4    0.00     0.0           0.57  MPI_Comm_size
        MPI     24        4    0.00     0.0           1.61  MPI_Comm_rank
        MPI     24        4    0.38    99.7       95168.50  MPI_Init
        COM     24        4    0.00     0.0          30.54  main

But i don't know how they measure it.Even i run it on a single machine,is MPI_Broadcast cost really 0% ???

Solution

From your example it seems that what you want to know is "the time from the first process entering the Bcast call until the time of the last process leaving the Bcast call". Note that not all of that time is actually spent inside MPI_Bcast. In fact, it is perfectly possible that some processes have left the Bcast call before others have even entered.

Anyway, probably the best way to go is to measure the time between the first Barrier and the Bcast exit on each process, and use a Reduction to find the maximum:

MPI_Barrier(MPI_COMM_WORLD);
local_mpi_bcast_time -= MPI_Wtime();

MPI_Bcast(data, num_elements, MPI_INT, 0, MPI_COMM_WORLD); 
local_mpi_bcast_time += MPI_Wtime();

MPI_Reduce(&local_mpi_bcast_time, &total_mpi_bcast_time, 1, MPI_DOUBLE,            
           MPI_MAX, 0, MPI_COMM_WORLD);

This is still not 100% accurate because processes may leave the barrier at slightly different times, but about the best you can get with MPI means.

I also suggest you take a look at the performance analysis tools suggested by Zulan, because they take care of all the mutli-process communication idiosyncrasies.