Search code examples
cmatlabimage-processingfftconvolution

Which method is Matlab/Octave using for conv2 function?


Which method is Matlab/Octave using for conv2 function?

Is it:

  • Fast Fourier Transform
  • 3 or more for-loops e.g classical iteration
  • Other?

I'm looking for the fastest method to do conv2. I'm going to write C code for it.


Solution

  • conv2 implements the direct method (i.e. 4 loops). This is not O(n4), it is O(nk), with n the number of pixels in the image and k the number of non-zero pixels in the kernel.

    The FFT implementation is O(n log n), independent of k, but the constant is quite large, so you need a large k to make this more efficient than the direct implementation.

    You can simply test this by comparing the time it takes to compute fft2(img) vs conv2(img,ones(3,3)):

    img = ones(1024,1024);
    k = ones(3,3);
    timeit(@()fft2(img))
    timeit(@()conv2(img,k,'same'))
    
    ans =
        0.0059
    
    ans =
        0.0015
    

    Computing the convolution through the FFT requires 3 times the fft2 call, plus a complex-valued multiplication, so would be more than 12 times slower than the direct implementation in this case.

    You can also see that conv2 time increases as the kernel size increases:

    >> k = ones(7,7); timeit(@()conv2(img,k,'same'))
    ans =
        0.0066
    
    >> k = ones(19,19); timeit(@()conv2(img,k,'same'))
    ans =
        0.0152
    

    This would not be the case if it were to use the FFT convolution.


    Note that MATLAB's implementation of the convolution is extremely efficient, it is hard to match this with your own C++ code. Still, even with a naive implementation, you don't want to use the FFT unless your kernel is quite large. If you do want to implement this yourself, see this other answer of mine for some tips on how to efficiently implement convolution.


    If you use the FFT to implement the convolution, you need either to accept the periodic boundary condition, or pad the image sufficiently to avoid boundary effects. Though you need this padding in the direct method as well if you don't want to assume zeros outside the image, like conv2 does (which is not at all useful).


    Alternatives

    For some kernels, there are much more efficient implementations than either the FFT or the conv2 method:

    • If the kernel is separable, you can apply 1D convolutions along image lines, then along image columns, to get to the final result in a strongly reduced computational effort (if the kernel is n-by-m, you'd do n+m instead of n*m multiplications and additions.

    • If the kernel has uniform weights (as in the example I used above) then you can apply it with a cost independent of the number of pixels in the kernel. As you shift the kernel over, you subtract pixels exiting the kernel on the left, and add pixels entering the kernel on the right.

    • If the kernel is a Gaussian or a Gabor kernel, you can also use well-known IIR implementations (which have a cost independent of the size of the kernel). IIR = infinite-impulse response.


    Gaussian filters

    DIPlib has three implementations of the Gaussian filter: FIR (the direct method, as a series of 1D filters along each axis), IIR, and FT (using the FFT). Here's a comparison of their timing for a 2048x2048 image, for different sigma (using the DIPimage toolbox that directly calls the DIPlib functionality):

    img = randn(2048, 2048, 'single');
    sigma = [1, 2, 3, 5, 7, 10, 15, 20];
    t = zeros(numel(sigma), 3);
    for ii = 1:numel(sigma)
        t(ii, 1) = timeit(@()gaussf(img, sigma(ii), 'fir'));
        t(ii, 2) = timeit(@()gaussf(img, sigma(ii), 'iir'));
        t(ii, 3) = timeit(@()gaussf(img, sigma(ii), 'ft'));
    end
    
    clf
    set(gca, 'FontSize', 12)
    plot(sigma, t * 1000, '.-', 'LineWidth', 1)
    legend({'FIR', 'IIR', 'FT'})
    xticks(sigma)
    xlabel('Sigma of Gaussian (kernel is 2\lceil 3\sigma\rceil + 1)')
    ylabel('Time (ms)')
    

    Timing comparisons for the three Gaussian implementations in DIPlib

    • FIR:
      • The 1D kernel size is 2*ceil(3*sigma)+1.
      • This implementation takes advantage of the fact that the kernel is symmetric, so instead of doing k multiplications and additions, it does k additions and k/2 multiplications. But other than that the code is fairly straight-forward (e.g. it's standard-compliant C++, no SIMD instructions are explicitly used, though the compiler should be using those when appropriate).
      • This is not linear in time, as you'd expect, because when applying the filter along the columns, the cache is used less optimally. So for larger sigma the memory access becomes a relatively larger cost.
    • IIR:
      • This implementation is quite old, from 1998, plain C.
      • The IIR filter is a 1D filter applied along each axis in turn, like the FIR filter.
    • FT:
      • For the FFT this used PocketFFT (I can build DIPlib with FFTW, but on my machine, an M1 Mac, that doesn't make much of a difference because FFTW doesn't have code specific to the M1 chip).
      • It uses only two FFTs: one forward for the image and one backwards for the result. The kernel itself is not transformed, but generated directly in the frequency domain. The Gaussian being separable, you can generate the 1D kernel and multiply it to both columns and rows of the FFT.
      • This method is slower for smaller sigma because it generates the kernel in the frequency domain, where larger sigmas produce smaller kernels (more zeros to multiply with).

    Sobel filters

    The Sobel filter is 3x3 but has only 6 non-zero weights. 6 multiplications and additions are very cheap to apply, it is never, ever cheaper to compute an FFT, let alone three. Note that traversing the image is the most expensive part here (it requires fetching the data from memory, which takes more time than 6 multiplications and additions). If you want to compute both derivatives, it is worth while combining them into the same loop over the image, to best use the cache.