CUDA cufft library 2D FFT only the left half plane correct

I am doing 2D FFT on 128 images of size 128 x 128 using CUFFT library. The way I used the library is the following:

unsigned int nx = 128; unsigned int ny = 128; unsigned int nz = 128;
// Make 2D fft batch plan
int n[2] = {nx, ny};
int inembed[] = {nx, ny};
int onembed[] = {nx, ny};

cufftPlanMany(&plan,
            2, // rank
            n, // dimension
            inembed,
            1, // istride
            nx * ny, // idist
            onembed,
            1, //ostride
            nx * ny, // odist
            CUFFT_D2Z,
            nz);
cufftSetCompatibilityMode(plan,CUFFT_COMPATIBILITY_NATIVE)

// Create output array
complex<double>* out_complex = new complex<double>[nx * ny * nz];
// Initialize output array
for (unsigned int i = 0; i < nx * ny * nz; i++) {
   out_complex[i].real(0);
   out_complex[i].imag(0);
}
cudaMalloc( (void**)&idata, sizeof(cufftDoubleReal) * nx * ny * nz );
cudaMalloc( (void**)&odata, sizeof(cufftDoubleComplex) * nx * ny * nz );
cudaMemcpy( idata, in_real, nx * ny  * nz * sizeof(cufftDoubleReal), 
                                  cudaMemcpyHostToDevice )  );
cudaMemcpy( odata, out_complex, nx * ny * nz *  sizeof(cufftDoubleComplex), 
                                  cudaMemcpyHostToDevice )  );

cufftExecD2Z( plan, idata, odata );

cudaMemcpy( out_complex, odata, nx * ny * nz * sizeof(cufftDoubleComplex),
                                  cudaMemcpyDeviceToHost ) );

The input in_real on the host is a big array holding the 3D images, which is a double array. I guess there should be no problem converting to/from double from/to cufftDoubleReal and complex from/to cufftDoubleComplex? I am a little suspicious about the way the plan was made and the parameters, which I tried to find some example online but they are not that helpful nor consistent. Then I just set the parameters via the programming guide using my own understanding.

As indicate by the title, the output is partially correct (the left half plane), with the right half plane zeros, which makes me so confused. I tried to set different types of compatibility mode but it was not that helpful. The version that I am comparing to is the MATLAB fft2().

Solution

You need to (re)read the documentation for real to complex transforms. Quoting:

In many practical applications the input vector is real-valued. It can be easily shown that in this case the output satisfies Hermitian symmetry ( X k = X N − k ∗ , where the star denotes complex conjugation). The converse is also true: for complex-Hermitian input the inverse transform will be purely real-valued. cuFFT takes advantage of this redundancy and works only on the first half of the Hermitian vector

i.e. the output of a real to complex transform is symmetrical, and cuFFT exploits this by not calculating the redundant (symmetrical) coefficients. Thus, it is normal to only get "half" the output of transform, because the other "half" is identical. This is not unique to cuFFT, FFTW and most other high performance FFT libraries work this way for real to complex transforms and complex to real inverse transforms.