cufft store callback called more than once

I am using a cufft store callback in a complex-to-complex, out-of-place, 1D, batch FFT (i.e. I am doing many 1D FFTs of the same size). From Section 2.9.4 of the documentation, I expect this callback to be called exactly once for each output. See in particular this quote, taken verbatim from the link:

cuFFT will call the load callback routine, for each point in the input, once and only once. Similarly it will call the store callback routine, for each point in the output, once and only once.

Nevertheless, I seem to have an example that contradicts this. In the code below, I expect to see each of the numbers 0-19 appear exactly once, corresponding to the store callback being called exactly once for each output sample. However, when I do 504 1D FFTs of size 32, the store callback gets called twice for each output!

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#include <cuda.h>    
#include <cuda_runtime.h>
#include <cufft.h>
#include <cufftXt.h>


// Very simple store callback: prints the index and does the store
static __device__ void stor_cb(void *a, size_t index, cufftComplex z,
                               void *cb_info, void *sharedmem) {

    // Print the index. Each index should appear exactly once.
    if (index < 20) printf("%8llu\n", index);

    // Do the store
    ((cufftComplex *)a)[index] = z;
}
__device__ cufftCallbackStoreC stor_cb_ptr_d = stor_cb;


int main() {
    size_t work_size;

    // With these parameters, the store callback is
    // called twice for each output
    int fft_sz = 32;            // Size of each FFT
    int num_ffts = 504;         // How many FFTs to do

    // With these parameters, the store callback is
    // called once for each output
//    int fft_sz = 1024;         // Size of each FFT
//    int num_ffts = 20;         // How many FFTs to do

    // Buffers
    cufftComplex *in_buf_h, *in_buf_d, *out_buf_d;

    // Allocate buffers on host and device
    in_buf_h = new cufftComplex[fft_sz*num_ffts];
    cudaMalloc(&in_buf_d, fft_sz*num_ffts*sizeof(cufftComplex));
    cudaMalloc(&out_buf_d, fft_sz*num_ffts*sizeof(cufftComplex));

    // Fill input buffer with zeros and copy to device
    memset(in_buf_h, 0, fft_sz*num_ffts*sizeof(cufftComplex));
    cudaMemcpy(in_buf_d, in_buf_h, fft_sz*num_ffts*sizeof(cufftComplex), cudaMemcpyHostToDevice);

    // Plan num_ffts of size fft_sz
    cufftHandle plan;
    cufftCreate(&plan);
    cufftMakePlan1d(plan, fft_sz, CUFFT_C2C, num_ffts, &work_size);

    // Associate save callback with plan
    cufftCallbackStoreC stor_cb_ptr_h;
    cudaMemcpyFromSymbol(&stor_cb_ptr_h, stor_cb_ptr_d, sizeof(stor_cb_ptr_h));
    cufftXtSetCallback(plan, (void **)&stor_cb_ptr_h, CUFFT_CB_ST_COMPLEX, 0);

    // Execute the plan. We don't actually care about values. The idea
    // is that the store callback should be called exactly once for
    // each of the fft_sz*num_ffts samples.
    cufftExecC2C(plan, in_buf_d, out_buf_d, -1);

    // Sync the device to flush the output
    cudaDeviceSynchronize();

    return 0;
}

Example output for fft_sz=32, num_ffts=504:

$ stor_cb_tst 
       0
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
       0
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19

In contrast, if I do 20 FFTs of size 1024, then I get the expected behavior: the store callback is called exactly once for each output. Example output for fft_sz=1024, num_ffts=20:

$ stor_cb_tst 
   0
   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19

Am I misunderstanding something, do I have a bug, or is this a problem with cufft?

I am running this on Linux Mint, with cuda V8.0.61, g++ 5.4.0, on a GeForce GTX 1080:

$ uname -a
Linux orpheus 4.4.0-53-generic #74-Ubuntu SMP Fri Dec 2 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61

$ g++ --version
g++ (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1080"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 8114 MBytes (8507752448 bytes)
  (20) Multiprocessors, (128) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1848 MHz (1.85 GHz)
  Memory Clock rate:                             5005 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1080
Result = PASS

Here is my compile command:

$ nvcc -ccbin g++ -dc -m64 -o stor_cb_tst.o -c stor_cb_tst.cu 
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
$ nvcc -ccbin g++ -m64 -o stor_cb_tst stor_cb_tst.o -lcufft_static -lculibos
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
$ ./stor_cb_tst 
       0
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
       0
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19

Solution

I was able to reproduce the observation on CUDA 8, but not on CUDA 9. However I don't think there is anything wrong here even with CUDA 8. Let's start by reviewing the documentation more closely:

From CUFFT doc section 2.9.4:

Similarly it will call the store callback routine, for each point in the output, once and only once.

What you've assumed is that each point in the output has a corresponding unique value of index passed to the store callback routine, however we shall see shortly that is not the case.

it will only call the store callback routine from the last phase kernel(s).

so we see that the store callback routine may be called from multiple independent kernels (notice use of kernel(s)), in the last phase of transformation.

For some configurations, threads may load or store inputs or outputs in any order, and cuFFT does not guarantee that the inputs or outputs handled by a given thread will be contiguous. These characteristics may vary with transform size, transform type (e.g. C2C vs C2R), number of dimensions, and GPU architecture. These variations may also change from one library version to the next.

This gives some additional clues that we ought not to expect a nice contiguous treatment of all the output data, in every case. And the indicated variability may depend on exact transform parameters, as well as CUFFT library version.

So let's get down to brass tacks. Is CUFFT calling the store callback more than once per output point? It is not. To prove this, let's modify your store callback as follows:

static __device__ void stor_cb(void *a, size_t index, cufftComplex z,
                               void *cb_info, void *sharedmem) {

    // Print the index. Each index should appear exactly once.
    //if (index < 20) printf("%8llu, %p, \n", index, a);
    cufftComplex temp = ((cufftComplex *)a)[index];
    temp.x++;
    ((cufftComplex *)a)[index] = temp;
    // Do the store
    //((cufftComplex *)a)[index] = z;
    if (index < 20) printf("%8llu, %p, %f\n", index, a, temp.x);

}

Instead of writing the expected output, this store callback will merely increment the output point it is given, by 1. Furthermore, instead of just printing out the index value and possibly making incorrect assumptions, we will print out the index, plus the base address a, plus the actual value that we incremented. To make this all work, we need to pre-initialize the entire output array to zero:

cudaMalloc(&out_buf_d, fft_sz*num_ffts*sizeof(cufftComplex));
cudaMemset(out_buf_d, 0, fft_sz*num_ffts*sizeof(cufftComplex));  // add this

When I compile and run that modified code, on CUDA 8, linux, and a cc3.5 device (Tesla K20x), here is the output:

$ nvcc -arch=sm_35 -o t20 t20.cu -rdc=true -lcufft_static -lcudadevrt -lculibos
$ ./t20
       0, 0x2305b5f800, 1.000000
       1, 0x2305b5f800, 1.000000
       2, 0x2305b5f800, 1.000000
       3, 0x2305b5f800, 1.000000
       4, 0x2305b5f800, 1.000000
       5, 0x2305b5f800, 1.000000
       6, 0x2305b5f800, 1.000000
       7, 0x2305b5f800, 1.000000
       8, 0x2305b5f800, 1.000000
       9, 0x2305b5f800, 1.000000
      10, 0x2305b5f800, 1.000000
      11, 0x2305b5f800, 1.000000
      12, 0x2305b5f800, 1.000000
      13, 0x2305b5f800, 1.000000
      14, 0x2305b5f800, 1.000000
      15, 0x2305b5f800, 1.000000
      16, 0x2305b5f800, 1.000000
      17, 0x2305b5f800, 1.000000
      18, 0x2305b5f800, 1.000000
      19, 0x2305b5f800, 1.000000
       0, 0x2305b7d800, 1.000000
       1, 0x2305b7d800, 1.000000
       2, 0x2305b7d800, 1.000000
       3, 0x2305b7d800, 1.000000
       4, 0x2305b7d800, 1.000000
       5, 0x2305b7d800, 1.000000
       6, 0x2305b7d800, 1.000000
       7, 0x2305b7d800, 1.000000
       8, 0x2305b7d800, 1.000000
       9, 0x2305b7d800, 1.000000
      10, 0x2305b7d800, 1.000000
      11, 0x2305b7d800, 1.000000
      12, 0x2305b7d800, 1.000000
      13, 0x2305b7d800, 1.000000
      14, 0x2305b7d800, 1.000000
      15, 0x2305b7d800, 1.000000
      16, 0x2305b7d800, 1.000000
      17, 0x2305b7d800, 1.000000
      18, 0x2305b7d800, 1.000000
      19, 0x2305b7d800, 1.000000
$

What we see is:

Yes, the index values are repeated, however the base address (pointer) for each repeat case is different. Therefore, even though the index value is repeated, the output point is only written to once.
As further confirmation, if we had written to an output point more than once, for our particular callback we would expect to see the output get incremented to 2.000000. But we only see 1.000000 in the output. So none of those output points were written to more than once.

I think its likely that this particular output pattern is emanating from 2 separate kernel calls in the final phase of transformation. Some further evidence of this could possibly be gained from the profiler.

As I mentioned in the beginning, I see different behavior when using CUDA 9 instead of CUDA 8 for this test case (only one set of output indices from 0 to 19 are printed out.) However this possibility (variation in behavior from library version to library version) is also accounted for in the documentation, as previously cited.

Anticipating a follow-up question:

But if the index value is not unique, and I want to apply some transformation to the output that varies based on index, what should I do?

I think the assumption here is that any transformation you would intend to apply to the output of a batched transform should only depend on the index position within the batch. Under that assumption, my expectation is:

The multi-kernel duplication of indices will always be done on a batch boundary.
An appropriate transform could be applied by doing a modulo-batch-size operation on the passed index value to the callback routine.

I advance this without proof, nor have I attempted to confirm this with the documentation, but it is the only implementation that makes sense to me, given the observations already covered. A takeaway is that if you had a transformation you wished to apply that varied from batch to batch, this may not be the way to do it (i.e. via a callback). However, as I mentioned already, things seemed to have changed in CUDA 9. If any of this is a concern to you, feel free to file an RFE (bug report) with the desired/expected behavior (and/or documentation update request) at http://developer.nvidia.com , keeping in mind that your expected behavior may already be implemented in CUDA 9.