Change locally allocated MPI memory in program

I have a M x N array 'A' that is to be distributed over 'np' processors using MPI in the second dimension(i.e N is the direction that is scattered). Each processor will be initially allocated M x N/np memory by fftw_mpi_local_size_2D (I have used this function from mpi because SIMD is efficient as per fftw3 manual).

initialisation: alloc_local=fftw_mpi_local_size_2d(M,N,MPI_COMM_WORLD,local_n,local_n_offset) pointer1=fftw_alloc_real(alloc_local) call c_f_pointer(pointer1,A[M,local_n])

At this point, each processor has a slab of A that is M x local_n=(N/np) size.

While doing a fourier transform: A(x,y) -> A(x,ky), here y is vertically downwards(not the MPI partitioned axis) in the array A. In fourier space I have to store M+2 x local_n number of elements (for a 1d real array of length M, M in fourier space has M+2 elements if we use this module from FFTW3 dfftw_execute_dft_r2c, ).

These fourier space operations I could do in dummy matrices in every processor independently.

There is one operation where I have to y fourier and x fourier cosine transform consecutively. To parallelise operations in the all fourier space, I want to gather my y fourier transformed arrays which are (M+2)xlocal_n size to M+2 x N bigger array and scatter them back again after a transpose so that y direction is the partitioned one. i.e( N x M+2 ) ----scatter---> (N x (M+2)/np) but each processor has been allocated only M x local_n addresses initially.

If I have M=N, then I still have (N x local_n + (2/np)) . I could resolve this by increasing allocated memory for 1 processor.

I don't want to start out with (N+2,N) and (N+2,local_n) because this will increase memory requirement for a lot of arrays and the above gymnastics has to be done only once per iteration.

Solution

No, you cannot easily change the allocated size of a Fortran array (MPI does not play any role here). What you can do is to use a different array for the receive buffer, deallocate the array and allocate it with the new size, or allocate it with the large enough size in the first place. Different choices will be appropriate in different situations. Without seeing your code I would go for the first one, but the last one cannot be excluded.

Note that FFTW3 has parallel (1D MPI decomposition, which is what you use) transforms built-in, including multidimensional transforms.