c++computer-vision sse edge-detection canny-operator

Non-maximum suppression in Canny's algorithm: optimize with SSE

I have the "honor" to improve the runtime of the following code of someone else. (it's a non-maximum supression from the canny - algorithm"). My first thought was to use SSE-intrinsic code, i'm very new in this area, so my question is.

Is there any chance to do this? And if so, can someone give me a few hints?

void vNonMaximumSupression(
          float* fpDst, 
          float const*const fpMagnitude, 
          unsigned char  const*const ucpGradient,                                                                           ///< [in] 0 -> 0°, 1 -> 45°, 2 -> 90°, 3 -> 135°
int iXCount, 
int iXOffset, 
int iYCount, 
int ignoreX, 
int ignoreY)
{
    memset(fpDst, 0, sizeof(fpDst[0]) * iXCount * iXOffset);

    for (int y = ignoreY; y < iYCount - ignoreY; ++y)
    {
        for (int x = ignoreX; x < iXCount - ignoreX; ++x)
        {
            int idx = iXOffset * y + x;
            unsigned char dir = ucpGradient[idx];
            float fMag = fpMagnitude[idx];

            if (dir == 0 && fpMagnitude[idx - 1]           < fMag && fMag > fpMagnitude[idx + 1] ||
                dir == 1 && fpMagnitude[idx - iXCount + 1] < fMag && fMag > fpMagnitude[idx + iXCount - 1] ||
                dir == 2 && fpMagnitude[idx - iXCount]     < fMag && fMag > fpMagnitude[idx + iXCount] ||
                dir == 3 && fpMagnitude[idx - iXCount - 1] < fMag && fMag > fpMagnitude[idx + iXCount + 1]
                )
                    fpDst[idx] = fMag;
            else
                fpDst[idx] = 0;
        }
    }
}

Solution

Discussion

As @harold noted, the main problem for vectorization here is that the algorithm uses different offset for each pixel (specified by direction matrix). I can think of several potential ways of vectorization:

@nikie: evaluate all branches at once, i.e. compare each pixel with all of its neighbors. The results of these comparisons are blended according to the direction values.
@PeterCordes: load a lot of pixels into SSE registers, then use _mm_shuffle_epi8 to choose only the neighbors in the given direction. Then perform two vectorized comparisons.
(me): use scalar instructions to load the proper two neighboring pixels along the direction. Then combine these values for four pixels into SSE registers. Finally, do two comparisons in SSE.

The second approach is very hard to implement efficiently, because for a pack of 4 pixels, there are 18 neighboring pixels to choose from. That would require too much shuffles, I think.

The first approach looks nice, but it would perform four times more operations per pixel. I suppose the speedup of vector instructions would be overwhelmed by too much computations done.

I suggest using the third approach. Below you can see hints on improving performance.

Hybrid approach

First of all, we want to make scalar code as fast as possible. The code presented by you contains too much branches. Most of them are not predictable, for instance the switch by direction.

In order to remove branches, I suggest creating an array delta = {1, stride - 1, stride, stride + 1}, which gives index offset from direction. By using this array, you can find indices of neighboring pixels to compare with (without branches). Then you do two comparisons. Finally, you can write a ternary operator like res = (isMax ? curr : 0);, hoping that compiler can generate branchless code for it.

Unfortunately, compiler (at least MSVC2013) is not smart enough to avoid branch by isMax. That's why we can benefit from rewriting the inner cycle with scalar SSE intrinsics. Look up the guide for reference. You need mostly intrinsics ending with _ss, since the code is completely scalar.

Finally, we can vectorize everything except for loading neighboring pixels. In order to load neighboring pixels, we can use _mm_setr_ps intrinsics with scalar arguments, asking the compiler to generate some good code for us =)

__m128 forw = _mm_setr_ps(src[idx+0 + offset0], src[idx+1 + offset1], src[idx+2 + offset2], src[idx+3 + offset3]);
__m128 back = _mm_setr_ps(src[idx+0 - offset0], src[idx+1 - offset1], src[idx+2 - offset2], src[idx+3 - offset3]);

I have just implemented it myself. Tested in single thread on Ivy Bridge 3.4Ghz. A random image of 1024 x 1024 resolution was used as a source. The results (in milliseconds) are:

original: 13.078     //your code
branchless: 8.556    //'branchless' code
scalarsse: 2.151     //after rewriting to sse intrinsics
hybrid: 1.159        //partially vectorized code

They confirm performance improvements on each step. The final code needs a bit more than one millisecond to process a one-megapixel image. The total speedup is about 11.3 times. Indeed, you can get better performance on GPU =)

I hope the presented information would be enough for you to reproduce the steps. If you are seeking terrible spoilers, look here for my implementations of all these stages.