In MPI, is MPI_Bcast
purely a convenience function or is there an efficiency advantage to using it instead of just looping over all ranks and sending the same message to all of them?
Rationale: MPI_Bcast
's behavior of sending the message to everyone, including the root, is inconvenient for me, so I'd rather not use it unless there's a good reason, or it can be made to not send the message to root.
Using MPI_Bcast will definitely be more efficient than rolling your own. A lot of work has been done in all MPI implementations to optimise collective operations based on factors such as the message size and the communication architecture.
For example, MPI_Bcast in MPICH2 would use a different algorithm depending on the size of the message. For short messages, a binary tree is used to minimise processing load and latency. For long messages, it is implemented as a binary tree scatter followed by an allgather.
In addition, HPC vendors often provide MPI implementations that make efficient use of the underlying interconnects, especially for collective operations. For example, it is possible to use a hardware supported multicast scheme or to use bespoke algorithms that can take advantage of the existing interconnects.