c++performance image-processing interpolation bicubic

How can I best improve the execution time of a bicubic interpolation algorithm?

I'm developing some image processing software in C++ on Intel which has to run a bicubic interpolation algorithm on small (about 1kpx) images over and over again. This takes a lot of time, and I'm aiming to speed it up. What I have now is a basic implementation based on the literature, a somewhat-improved (with regard to speed) version which doesn't do matrix multiplication, but rather uses pre-calculated formulas for parts of the interpolating polynomial and last, a fixed-point version of the matrix-multiplying code (works slower actually). I also have an external library with an optimized implementation, but it's still too slow for my needs. What I was considering next is:

vectorization using MMX/SSE stream processing, on both the floating and fixed-point versions
doing the interpolation in the Fourier domain using convolution
shifting the work onto a GPU using OpenCL or similar

Which of these approaches could yield greatest performance gains? Could you suggest another? Thanks.

Solution

I think GPU is the way to go. It's probably the most natural task for this type of hardware. I would start by looking into CUDA or OpenCL. Older techniques like simple DirectX/OpenGL pixel/fragment shaders should work just fine as well.

Some links I found, maybe they could help you: