It looks like my application starting to be (i)FFT-bounded, it doing a lot of 2D correlations for rectangles with average sizes about 500x200 (width and height always even). Scenario is as usual - do two FFT (one per field), multiply complex fields, then one iFFT.
So, on CPU (Intel Q6600, with JTransforms libraly) FFT-transformations eating about 70% of time according to profiler, on GPU (GTX670, cuFFT library) - about 50% (so, there is some performance increase on CUDA, but not what I want). I realize, that it's may be the case that GPU not fully saturated (bandwith limited), but from other case - doing calculation in batches will significantly increase application complexity.
Questions:
I'm answering your first question: what I can do further to decrease time spent by cuFFT?
Quoting the CUFFT LIBRARY USER'S GUIDE
- Restrict the size along all dimensions to be representable as
2^a*3^b*5^c*7^d
. The CUFFT library has highly optimized kernels for transforms whose dimensions have these prime factors.- Restrict the size along each dimension to use fewer distinct prime factors. For example, a transform of size
3^n
will usually be faster than one of size2^i*3^j
even if the latter is slightly smaller.- Restrict the power-of-two factorization term of the
x
dimension to be a multiple of either256
for single-precision transforms or64
for double-precision transforms. This further aids with memory coalescing.- Restrict the
x
dimension of single-precision transforms to be strictly a power of two either between2
and8192
for Fermi-class, Kepler-class, and more recent GPUs or between2
and2048
for earlier architectures. These transforms are implemented as specialized hand-coded kernels that keep all intermediate results in shared memory.- Use native compatibility mode for in-place complex-to-real or real-to-complex transforms. This scheme reduces the write/read of padding bytes hence helping with coalescing of the data.
Starting with version 3.1 of the CUFFT Library, the conjugate symmetry property of real-to-complex output data arrays and complex-to-real input data arrays is exploited when the power-of-two factorization term of the x dimension is at least a multiple of 4. Large 1D sizes (powers-of-two larger than 65,536), 2D, and 3D transforms benefit the most from the performance optimizations in the implementation of real-to-complex or complex-to-real transforms.
Other things you can do are (Quoting Robert Crovella's answer to running FFTW on GPU vs using CUFFT):
cuFFT routines can be called by multiple host threads, so it is possible to make multiple calls into cufft for multiple independent transforms. It's unlikely you would see much speedup from this if the individual transforms are large enough to utilize the machine.
cufft also supports batched plans which is another way to execute multiple transforms "at once".
Please, note that: