cuFFT R2C batch output size doesn't match input size

I'm experimenting with the batch with cuFFT. But I don't think I'm getting the right output.

int NX = 16;    // size of the array
int BATCH = 16; // # of batch

I'm allocating two arrays on the GPU:

float *src;
cufftComplex *dst;
cudaMalloc((void**)&src, sizeof(float)*NX*BATCH);
cudaMalloc((void**)&dst, sizeof(cufftComplex)*NX*BATCH);

I'm initializing the source array with a simple kernel like this:

__global__ void initFloatArray(float *data, const int size) {
  const int i = (blockIdx.x * blockDim.x) + threadIdx.x;
  if (i < size) {
    data[i] = i % NX;
  }
}

so basically, each array has values to goes from 0 to 15. And I get this 16 times.

I create my plan like this:

cufftPlanMany(&plan, 1, &NX, nullptr, 1, NX, nullptr, 1, NX, CUFFT_R2C, BATCH);

and then I'm executing my plan:

cufftExecR2C(plan, src, dst);

Finally, I transfer the content of dst back to the host. But when I print out the values, I'm getting this:

 BATCH 0:
  <120, 0>.length = 120
  <-8, 40.2187>.length = 41.0066
  <-8, 19.3137>.length = 20.905
  <-8, 11.9728>.length = 14.3996
  <-8, 8>.length = 11.3137
  <-8, 5.34543>.length = 9.62152
  <-8, 3.31371>.length = 8.65914
  <-8, 1.5913>.length = 8.15673
  <-8, 0>.length = 8
  <120, 0>.length = 120
  <-8, 40.2187>.length = 41.0066
  <-8, 19.3137>.length = 20.905
  <-8, 11.9728>.length = 14.3996
  <-8, 8>.length = 11.3137
  <-8, 5.34543>.length = 9.62152
  <-8, 3.31371>.length = 8.65914
 BATCH 1:
  <-8, 1.5913>.length = 8.15673
  <-8, 0>.length = 8
  <120, 0>.length = 120
  <-8, 40.2187>.length = 41.0066
  <-8, 19.3137>.length = 20.905
  <-8, 11.9728>.length = 14.3996
  ...

I was expecting a repetitive output, but it's repeat every 9 numbers, instead of every 16 like it should.

Am I doing something wrong? Or is there something I'm not understanding.

Solution

The DFT of a real-valued signal exhibit Hermitian symmetry (see real-input DFT on wikipedia). As a result, the full N complex output values of a N-point DFT can be constructed from only the first N/2+1 output values (ie. the other outputs are redundant).

Correspondingly and as with many FFT implementations for real-valued inputs, cuFFT does not return the redundant upper portion of the spectrum (as indicated in section 2.4 of cuFFT library user's guide). In your case with a 16-point FFT, you would thus get 16/2 + 1 = 9 non-redundant outputs. Those 9 values per FFT then get packed back-to-back in your final dst buffer (thus a new FFT result starts every 9 complex number).