speed up 2D correlation

It looks like my application starting to be (i)FFT-bounded, it doing a lot of 2D correlations for rectangles with average sizes about 500x200 (width and height always even). Scenario is as usual - do two FFT (one per field), multiply complex fields, then one iFFT.

So, on CPU (Intel Q6600, with JTransforms libraly) FFT-transformations eating about 70% of time according to profiler, on GPU (GTX670, cuFFT library) - about 50% (so, there is some performance increase on CUDA, but not what I want). I realize, that it's may be the case that GPU not fully saturated (bandwith limited), but from other case - doing calculation in batches will significantly increase application complexity.

Questions:

what I can do further to decrease time spent on FFT at least several times?
should I try FFTW library (at this moment I am not sure that it will give significant gain comparing to JTransforms) ?
are there any specialized hardware which can be plugged to PC for FFT-conversions ?

Solution

I'm answering your first question: what I can do further to decrease time spent by cuFFT?

Quoting the CUFFT LIBRARY USER'S GUIDE

Restrict the size along all dimensions to be representable as 2^a*3^b*5^c*7^d. The CUFFT library has highly optimized kernels for transforms whose dimensions have these prime factors.

Restrict the size along each dimension to use fewer distinct prime factors. For example, a transform of size 3^n will usually be faster than one of size 2^i*3^j even if the latter is slightly smaller.

Restrict the power-of-two factorization term of the x dimension to be a multiple of either 256 for single-precision transforms or 64 for double-precision transforms. This further aids with memory coalescing.

Restrict the x dimension of single-precision transforms to be strictly a power of two either between 2 and 8192 for Fermi-class, Kepler-class, and more recent GPUs or between 2 and 2048 for earlier architectures. These transforms are implemented as specialized hand-coded kernels that keep all intermediate results in shared memory.

Use native compatibility mode for in-place complex-to-real or real-to-complex transforms. This scheme reduces the write/read of padding bytes hence helping with coalescing of the data.

Starting with version 3.1 of the CUFFT Library, the conjugate symmetry property of real-to-complex output data arrays and complex-to-real input data arrays is exploited when the power-of-two factorization term of the x dimension is at least a multiple of 4. Large 1D sizes (powers-of-two larger than 65,536), 2D, and 3D transforms benefit the most from the performance optimizations in the implementation of real-to-complex or complex-to-real transforms.

Other things you can do are (Quoting Robert Crovella's answer to running FFTW on GPU vs using CUFFT):

cuFFT routines can be called by multiple host threads, so it is possible to make multiple calls into cufft for multiple independent transforms. It's unlikely you would see much speedup from this if the individual transforms are large enough to utilize the machine.
cufft also supports batched plans which is another way to execute multiple transforms "at once".

Please, note that:

cuFFT may be not be convenient as compared to an optimized sequential or multicore FFT if the dimensions of the transform are not enough large;
You can get a rough idea on the performance of cuFFT as compared to Intel MKL from CUDA Toolkit 4.0 Performance Report.