I am using a cufft store callback in a complex-to-complex, out-of-place, 1D, batch FFT (i.e. I am doing many 1D FFTs of the same size). From Section 2.9.4 of the documentation, I expect this callback to be called exactly once for each output. See in particular this quote, taken verbatim from the link:
cuFFT will call the load callback routine, for each point in the input, once and only once. Similarly it will call the store callback routine, for each point in the output, once and only once.
Nevertheless, I seem to have an example that contradicts this. In the code below, I expect to see each of the numbers 0-19 appear exactly once, corresponding to the store callback being called exactly once for each output sample. However, when I do 504 1D FFTs of size 32, the store callback gets called twice for each output!
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cufft.h>
#include <cufftXt.h>
// Very simple store callback: prints the index and does the store
static __device__ void stor_cb(void *a, size_t index, cufftComplex z,
void *cb_info, void *sharedmem) {
// Print the index. Each index should appear exactly once.
if (index < 20) printf("%8llu\n", index);
// Do the store
((cufftComplex *)a)[index] = z;
}
__device__ cufftCallbackStoreC stor_cb_ptr_d = stor_cb;
int main() {
size_t work_size;
// With these parameters, the store callback is
// called twice for each output
int fft_sz = 32; // Size of each FFT
int num_ffts = 504; // How many FFTs to do
// With these parameters, the store callback is
// called once for each output
// int fft_sz = 1024; // Size of each FFT
// int num_ffts = 20; // How many FFTs to do
// Buffers
cufftComplex *in_buf_h, *in_buf_d, *out_buf_d;
// Allocate buffers on host and device
in_buf_h = new cufftComplex[fft_sz*num_ffts];
cudaMalloc(&in_buf_d, fft_sz*num_ffts*sizeof(cufftComplex));
cudaMalloc(&out_buf_d, fft_sz*num_ffts*sizeof(cufftComplex));
// Fill input buffer with zeros and copy to device
memset(in_buf_h, 0, fft_sz*num_ffts*sizeof(cufftComplex));
cudaMemcpy(in_buf_d, in_buf_h, fft_sz*num_ffts*sizeof(cufftComplex), cudaMemcpyHostToDevice);
// Plan num_ffts of size fft_sz
cufftHandle plan;
cufftCreate(&plan);
cufftMakePlan1d(plan, fft_sz, CUFFT_C2C, num_ffts, &work_size);
// Associate save callback with plan
cufftCallbackStoreC stor_cb_ptr_h;
cudaMemcpyFromSymbol(&stor_cb_ptr_h, stor_cb_ptr_d, sizeof(stor_cb_ptr_h));
cufftXtSetCallback(plan, (void **)&stor_cb_ptr_h, CUFFT_CB_ST_COMPLEX, 0);
// Execute the plan. We don't actually care about values. The idea
// is that the store callback should be called exactly once for
// each of the fft_sz*num_ffts samples.
cufftExecC2C(plan, in_buf_d, out_buf_d, -1);
// Sync the device to flush the output
cudaDeviceSynchronize();
return 0;
}
Example output for fft_sz=32, num_ffts=504:
$ stor_cb_tst
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
In contrast, if I do 20 FFTs of size 1024, then I get the expected behavior: the store callback is called exactly once for each output. Example output for fft_sz=1024, num_ffts=20:
$ stor_cb_tst
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Am I misunderstanding something, do I have a bug, or is this a problem with cufft?
I am running this on Linux Mint, with cuda V8.0.61, g++ 5.4.0, on a GeForce GTX 1080:
$ uname -a
Linux orpheus 4.4.0-53-generic #74-Ubuntu SMP Fri Dec 2 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61
$ g++ --version
g++ (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1080"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 8114 MBytes (8507752448 bytes)
(20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores
GPU Max Clock rate: 1848 MHz (1.85 GHz)
Memory Clock rate: 5005 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1080
Result = PASS
Here is my compile command:
$ nvcc -ccbin g++ -dc -m64 -o stor_cb_tst.o -c stor_cb_tst.cu
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
$ nvcc -ccbin g++ -m64 -o stor_cb_tst stor_cb_tst.o -lcufft_static -lculibos
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
$ ./stor_cb_tst
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
I was able to reproduce the observation on CUDA 8, but not on CUDA 9. However I don't think there is anything wrong here even with CUDA 8. Let's start by reviewing the documentation more closely:
From CUFFT doc section 2.9.4:
Similarly it will call the store callback routine, for each point in the output, once and only once.
What you've assumed is that each point in the output has a corresponding unique value of index
passed to the store callback routine, however we shall see shortly that is not the case.
it will only call the store callback routine from the last phase kernel(s).
so we see that the store callback routine may be called from multiple independent kernels (notice use of kernel(s)), in the last phase of transformation.
For some configurations, threads may load or store inputs or outputs in any order, and cuFFT does not guarantee that the inputs or outputs handled by a given thread will be contiguous. These characteristics may vary with transform size, transform type (e.g. C2C vs C2R), number of dimensions, and GPU architecture. These variations may also change from one library version to the next.
This gives some additional clues that we ought not to expect a nice contiguous treatment of all the output data, in every case. And the indicated variability may depend on exact transform parameters, as well as CUFFT library version.
So let's get down to brass tacks. Is CUFFT calling the store callback more than once per output point? It is not. To prove this, let's modify your store callback as follows:
static __device__ void stor_cb(void *a, size_t index, cufftComplex z,
void *cb_info, void *sharedmem) {
// Print the index. Each index should appear exactly once.
//if (index < 20) printf("%8llu, %p, \n", index, a);
cufftComplex temp = ((cufftComplex *)a)[index];
temp.x++;
((cufftComplex *)a)[index] = temp;
// Do the store
//((cufftComplex *)a)[index] = z;
if (index < 20) printf("%8llu, %p, %f\n", index, a, temp.x);
}
Instead of writing the expected output, this store callback will merely increment the output point it is given, by 1. Furthermore, instead of just printing out the index
value and possibly making incorrect assumptions, we will print out the index
, plus the base address a
, plus the actual value that we incremented. To make this all work, we need to pre-initialize the entire output array to zero:
cudaMalloc(&out_buf_d, fft_sz*num_ffts*sizeof(cufftComplex));
cudaMemset(out_buf_d, 0, fft_sz*num_ffts*sizeof(cufftComplex)); // add this
When I compile and run that modified code, on CUDA 8, linux, and a cc3.5 device (Tesla K20x), here is the output:
$ nvcc -arch=sm_35 -o t20 t20.cu -rdc=true -lcufft_static -lcudadevrt -lculibos
$ ./t20
0, 0x2305b5f800, 1.000000
1, 0x2305b5f800, 1.000000
2, 0x2305b5f800, 1.000000
3, 0x2305b5f800, 1.000000
4, 0x2305b5f800, 1.000000
5, 0x2305b5f800, 1.000000
6, 0x2305b5f800, 1.000000
7, 0x2305b5f800, 1.000000
8, 0x2305b5f800, 1.000000
9, 0x2305b5f800, 1.000000
10, 0x2305b5f800, 1.000000
11, 0x2305b5f800, 1.000000
12, 0x2305b5f800, 1.000000
13, 0x2305b5f800, 1.000000
14, 0x2305b5f800, 1.000000
15, 0x2305b5f800, 1.000000
16, 0x2305b5f800, 1.000000
17, 0x2305b5f800, 1.000000
18, 0x2305b5f800, 1.000000
19, 0x2305b5f800, 1.000000
0, 0x2305b7d800, 1.000000
1, 0x2305b7d800, 1.000000
2, 0x2305b7d800, 1.000000
3, 0x2305b7d800, 1.000000
4, 0x2305b7d800, 1.000000
5, 0x2305b7d800, 1.000000
6, 0x2305b7d800, 1.000000
7, 0x2305b7d800, 1.000000
8, 0x2305b7d800, 1.000000
9, 0x2305b7d800, 1.000000
10, 0x2305b7d800, 1.000000
11, 0x2305b7d800, 1.000000
12, 0x2305b7d800, 1.000000
13, 0x2305b7d800, 1.000000
14, 0x2305b7d800, 1.000000
15, 0x2305b7d800, 1.000000
16, 0x2305b7d800, 1.000000
17, 0x2305b7d800, 1.000000
18, 0x2305b7d800, 1.000000
19, 0x2305b7d800, 1.000000
$
What we see is:
index
values are repeated, however the base address (pointer) for each repeat case is different. Therefore, even though the index
value is repeated, the output point is only written to once.I think its likely that this particular output pattern is emanating from 2 separate kernel calls in the final phase of transformation. Some further evidence of this could possibly be gained from the profiler.
As I mentioned in the beginning, I see different behavior when using CUDA 9 instead of CUDA 8 for this test case (only one set of output indices from 0 to 19 are printed out.) However this possibility (variation in behavior from library version to library version) is also accounted for in the documentation, as previously cited.
Anticipating a follow-up question:
But if the
index
value is not unique, and I want to apply some transformation to the output that varies based onindex
, what should I do?
I think the assumption here is that any transformation you would intend to apply to the output of a batched transform should only depend on the index position within the batch. Under that assumption, my expectation is:
The multi-kernel duplication of indices will always be done on a batch boundary.
An appropriate transform could be applied by doing a modulo-batch-size operation on the passed index
value to the callback routine.
I advance this without proof, nor have I attempted to confirm this with the documentation, but it is the only implementation that makes sense to me, given the observations already covered. A takeaway is that if you had a transformation you wished to apply that varied from batch to batch, this may not be the way to do it (i.e. via a callback). However, as I mentioned already, things seemed to have changed in CUDA 9. If any of this is a concern to you, feel free to file an RFE (bug report) with the desired/expected behavior (and/or documentation update request) at http://developer.nvidia.com , keeping in mind that your expected behavior may already be implemented in CUDA 9.