Which method is Matlab/Octave using for conv2
function?
Is it:
I'm looking for the fastest method to do conv2
. I'm going to write C code for it.
conv2
implements the direct method (i.e. 4 loops). This is not O(n4), it is O(nk), with n the number of pixels in the image and k the number of non-zero pixels in the kernel.
The FFT implementation is O(n log n), independent of k, but the constant is quite large, so you need a large k to make this more efficient than the direct implementation.
You can simply test this by comparing the time it takes to compute fft2(img)
vs conv2(img,ones(3,3))
:
img = ones(1024,1024);
k = ones(3,3);
timeit(@()fft2(img))
timeit(@()conv2(img,k,'same'))
ans =
0.0059
ans =
0.0015
Computing the convolution through the FFT requires 3 times the fft2
call, plus a complex-valued multiplication, so would be more than 12 times slower than the direct implementation in this case.
You can also see that conv2
time increases as the kernel size increases:
>> k = ones(7,7); timeit(@()conv2(img,k,'same'))
ans =
0.0066
>> k = ones(19,19); timeit(@()conv2(img,k,'same'))
ans =
0.0152
This would not be the case if it were to use the FFT convolution.
Note that MATLAB's implementation of the convolution is extremely efficient, it is hard to match this with your own C++ code. Still, even with a naive implementation, you don't want to use the FFT unless your kernel is quite large. If you do want to implement this yourself, see this other answer of mine for some tips on how to efficiently implement convolution.
If you use the FFT to implement the convolution, you need either to accept the periodic boundary condition, or pad the image sufficiently to avoid boundary effects. Though you need this padding in the direct method as well if you don't want to assume zeros outside the image, like conv2
does (which is not at all useful).
For some kernels, there are much more efficient implementations than either the FFT or the conv2
method:
If the kernel is separable, you can apply 1D convolutions along image lines, then along image columns, to get to the final result in a strongly reduced computational effort (if the kernel is n
-by-m
, you'd do n+m
instead of n*m
multiplications and additions.
If the kernel has uniform weights (as in the example I used above) then you can apply it with a cost independent of the number of pixels in the kernel. As you shift the kernel over, you subtract pixels exiting the kernel on the left, and add pixels entering the kernel on the right.
If the kernel is a Gaussian or a Gabor kernel, you can also use well-known IIR implementations (which have a cost independent of the size of the kernel). IIR = infinite-impulse response.
DIPlib has three implementations of the Gaussian filter: FIR (the direct method, as a series of 1D filters along each axis), IIR, and FT (using the FFT). Here's a comparison of their timing for a 2048x2048 image, for different sigma (using the DIPimage toolbox that directly calls the DIPlib functionality):
img = randn(2048, 2048, 'single');
sigma = [1, 2, 3, 5, 7, 10, 15, 20];
t = zeros(numel(sigma), 3);
for ii = 1:numel(sigma)
t(ii, 1) = timeit(@()gaussf(img, sigma(ii), 'fir'));
t(ii, 2) = timeit(@()gaussf(img, sigma(ii), 'iir'));
t(ii, 3) = timeit(@()gaussf(img, sigma(ii), 'ft'));
end
clf
set(gca, 'FontSize', 12)
plot(sigma, t * 1000, '.-', 'LineWidth', 1)
legend({'FIR', 'IIR', 'FT'})
xticks(sigma)
xlabel('Sigma of Gaussian (kernel is 2\lceil 3\sigma\rceil + 1)')
ylabel('Time (ms)')
2*ceil(3*sigma)+1
.k
multiplications and additions, it does k
additions and k/2
multiplications. But other than that the code is fairly straight-forward (e.g. it's standard-compliant C++, no SIMD instructions are explicitly used, though the compiler should be using those when appropriate).The Sobel filter is 3x3 but has only 6 non-zero weights. 6 multiplications and additions are very cheap to apply, it is never, ever cheaper to compute an FFT, let alone three. Note that traversing the image is the most expensive part here (it requires fetching the data from memory, which takes more time than 6 multiplications and additions). If you want to compute both derivatives, it is worth while combining them into the same loop over the image, to best use the cache.