My MPI Program to measure broadcast time:
MPI_Barrier(MPI_COMM_WORLD);
total_mpi_bcast_time -= MPI_Wtime();
MPI_Bcast(data, num_elements, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
total_mpi_bcast_time += MPI_Wtime();
We need MPI_Barrier to wait until all processes do its jobs completed (synchronization) .But in fact, MPI_Barrier is a collective communication(all processes report to root process to continue program).And so our measured time will be Barrier_time + Broadcast_time.
So how to measure only broadcast time correctly ???
This is result from Scalasca:
Estimated aggregate size of event trace: 1165 bytes
Estimated requirements for largest trace buffer (max_buf): 292 bytes
Estimated memory requirements (SCOREP_TOTAL_MEMORY): 4097kB
(hint: When tracing set SCOREP_TOTAL_MEMORY=4097kB to avoid intermediate flushes
or reduce requirements using USR regions filters.)
flt type max_buf[B] visits time[s] time[%] time/visit[us] region
ALL 291 32 0.38 100.0 11930.30 ALL
MPI 267 28 0.38 100.0 13630.27 MPI
COM 24 4 0.00 0.0 30.54 COM
MPI 114 8 0.00 0.1 33.08 MPI_Barrier
MPI 57 4 0.00 0.0 26.53 MPI_Bcast
MPI 24 4 0.00 0.2 148.50 MPI_Finalize
MPI 24 4 0.00 0.0 0.57 MPI_Comm_size
MPI 24 4 0.00 0.0 1.61 MPI_Comm_rank
MPI 24 4 0.38 99.7 95168.50 MPI_Init
COM 24 4 0.00 0.0 30.54 main
But i don't know how they measure it.Even i run it on a single machine,is MPI_Broadcast cost really 0% ???
From your example it seems that what you want to know is "the time from the first process entering the Bcast call until the time of the last process leaving the Bcast call". Note that not all of that time is actually spent inside MPI_Bcast. In fact, it is perfectly possible that some processes have left the Bcast call before others have even entered.
Anyway, probably the best way to go is to measure the time between the first Barrier and the Bcast exit on each process, and use a Reduction to find the maximum:
MPI_Barrier(MPI_COMM_WORLD);
local_mpi_bcast_time -= MPI_Wtime();
MPI_Bcast(data, num_elements, MPI_INT, 0, MPI_COMM_WORLD);
local_mpi_bcast_time += MPI_Wtime();
MPI_Reduce(&local_mpi_bcast_time, &total_mpi_bcast_time, 1, MPI_DOUBLE,
MPI_MAX, 0, MPI_COMM_WORLD);
This is still not 100% accurate because processes may leave the barrier at slightly different times, but about the best you can get with MPI means.
I also suggest you take a look at the performance analysis tools suggested by Zulan, because they take care of all the mutli-process communication idiosyncrasies.