Why is the calculation of a transpose required when calculating the FFT of a matrix/multidimensional array?

I am currently working on a problem that involves calculating the Fourier Transform of a 2-dimensional Matrix using FFTW in conjunction with MPI.

According to the first section of this page of the FFTW documentation, computing the FT of a matrix requires a lot of communication among the launched processes when it is parallelized. For example, in the case of a matrix such as

1,2,3
4,5,6
7,8,9

the FFT of which we decided to compute with three processes, each process gets a chunk of this matrix so that the data is stored in the local memory of each process as P_0 [1,2,3], P_1 [4,5,6], P_3 [7,8,9].

After computing the initial Fourier Transform, a transposition is calculated which requires the communication of the computed data among the three processes as following:

P_0 -> P_1: 2
P_0 -> P_2: 3
P_1 -> P_0: 4
P_1 -> P_2: 6
P_2 -> P_0: 7
P_2 -> P_1: 8

Since this incurs a large overhead, the second section of the abovementioned FFTW documentation page suggests to calculate the transpose directly in the fourier space to drastically reduce the amount of cross-process communication required.

I am still unsure as to why exactly the calculation of a transpose is necessary in the first place, and how the data calculated from the different chunks of the matrix is merged in the end. Why is it enough to set the FFTW_MPI_TRANSPOSED_OUT and FFTW_MPI_TRANSPOSED_IN flags to not require cross process communication anymore, even though before the processes needed so much more data from all the other processes?

Solution

The 2D FFT is computed by first computing 1D FFTs over the rows of the array, then 1D FFTs over the columns (or the other way around, doesn't matter). The first transpose mentioned in the documentation is required to get the data for one column together in one process so that 1D FFT can be computed. The second transpose is required to get the resulting array in the right order again.

step 1:
| a b c |  -> 1D FFT ->  | A B C |
| d e f |  -> 1D FFT ->  | D E F | (each row is data belonging to one process)
| g h i |  -> 1D FFT ->  | G H I |
step 2:
| A B C |  -> transpose ->  | A D G |
| D E F |  -> transpose ->  | B E H | (inter-process communication)
| G H I |  -> transpose ->  | C F I |
step 3:
| A D G |  -> 1D FFT ->  | J M P |
| B E H |  -> 1D FFT ->  | K N Q |
| C F I |  -> 1D FFT ->  | L O R |
step 4:
| J M P |  -> transpose ->  | J K L |
| K N Q |  -> transpose ->  | M N O | (the actual 2D FFT of the initial array)
| L O R |  -> transpose ->  | P Q R |

If you are happy using the result of step 3 instead of that of step 4, then you don't need to do step 4, which saves a lot of communication time. Whether that result is useful or the transpose is needed depends on the application, and this is why the library allows you to make that choice.

Whether one specifies that the input or output is transposed affects the order of the array size passed into the function. In both cases, the last transpose is skipped:

Note that FFTW_MPI_TRANSPOSED_IN is completely equivalent to performing FFTW_MPI_TRANSPOSED_OUT and passing the first two dimensions to the planner in reverse order, or vice versa. If you pass both the FFTW_MPI_TRANSPOSED_IN and FFTW_MPI_TRANSPOSED_OUT flags, it is equivalent to swapping the first two dimensions passed to the planner and passing neither flag.

It is not possible to skip both transposes.

For example, to compute a convolution through the FFT, one would apply the forward FFT to an array, multiply with a kernel, then apply the inverse FFT. In this case, the multiplication works just as well with a transposed array, so we can use FFTW_MPI_TRANSPOSED_OUT for the forward FFT and FFTW_MPI_TRANSPOSED_IN for the inverse FFT. The final result would be identical to what we'd get if we didn't pass any flags and let all the transpositions happen, but it would be more efficient.