How much can MPI_Alltoall outperform MPI_Alltoallv?

I wonder what is the difference in terms of running time between executing the MPI_Alltoallv and MPI_Alltoall functions when the amount of transferred data is approximately the same? I couldn't find any such benchmark results. I am interested in large-scale instances, where tens of thousands or better hundreds of thousand of MPI processes are used and where these processes correspond to a substantial part of a given HPC system (considering at best some modern ones, such as BG/Q, Cray XC30, Cray XE6, ...).

Solution

Overview

One of the big advantages of MPI_Alltoall is that protocol decisions can be made quickly because they depend on a handful of scalars. In contrast, if a library implementer wants to optimize MPI_Alltoallv, they have to scan four vectors to determine if, for example, the communication is nearly homogeneous, highly sparse, or some other pattern.

The other issue is that MPI_Alltoall can easily use the output buffer as scratch space because every process provides and consumes the same amount of data. For MPI_Alltoallv, it's not practical to do all the bookkeeping, so any scratch space is going to be allocated. I can't remember the specifics of this issue, but I think I've read it somewhere in the MPI canon.

Implementation Skeletons

There are at least two special cases of alltoallv for which one can optimize better than the MPI library can:

Nearly homogeneous communication, i.e. the count vectors are nearly constant. This can happen when you have a distributed array that doesn't divide evenly across the process grid. In this case, you can:

a. Pad your arrays and use MPI_Alltoall directly.

b. Use MPI_Alltoall for the subset of processes that have homogeneous communication and either MPI_Alltoallv or a batch of Send-Recv for the remainder. This works best if you can cache the associated communicators. Using nonblocking communication should help too.

c. Write your own implementation of Bruck that handles the cases where the count varies, which is likely at the end of your vector. Having not done this myself, I don't know how difficult or worthwhile this one is.
Sparse communication, i.e. the count vector contains a large number of zeros. For this case, just use a batch of nonblocking Send-Recv and Waitall, because that's likely the best the MPI library will ever do and doing it yourself allows you to tune the batch size if you want.

Papers

MPI on a Million Processors describes the scalabillity issue associated with vector collectives. Granted, you may not see the cost of scanning the vector arguments on most CPUs, but it is an O(n) problem that motivates implementers to not touch the vector arguments more than necessary.

HykSort: a new variant of hypercube quicksort on distributed memory architectures describes a custom implementation that performs much better than optimized libraries. Such an optimization is rather difficult to implement inside of an MPI library, because it may be rather specialized. (This reference is targeted at Hristo's comment, not your question, by the way.)

Code

You can discover some interesting things by comparing the implementations of these operations in MPICH (https://github.com/pmodels/mpich/blob/main/src/mpi/coll/alltoall.c and https://github.com/pmodels/mpich/blob/main/src/mpi/coll/alltoallv.c). Only MPI_Alltoall uses Bruck's algorithm and pairwise exchange. Similar conclusions can be drawn from the available options for I_MPI_ADJUST_ALLTOALL and I_MPI_ADJUST_ALLTOALLV on https://software.intel.com/en-us/node/528906. Whether these limitations are fundamental or merely practical is left as an exercise for the reader.

Practical Experience

When MPI_Alltoall on Blue Gene/P used DCMF_Alltoallv (source code), so there was no difference relative to MPI_Alltoallv, and the latter might have even been better since the application pre-populated the vector arguments.

I wrote a version of all-to-all exchange for Blue Gene/Q that was as fast as MPI_Alltoall. My version was agnostic to constant versus vector arguments so this result implies that MPI_Alltoallv would perform similarly to MPI_Alltoall. However, I can't find the code now to be absolutely sure of the details.

However, Blue Gene networks were rather special, particularly w.r.t. all-to-all, so the behavior on fat-tree or dragonly networks on systems where the CPU is much faster than the network will be quite different.

I suggest you write a benchmark and measure it where you intend to run your application. Once you have some data, it will be much easier to figure out what optimizations may be missed.