AVX2 Winner-Take-All Disparity Search

I am optimizing the "winner-take-all" portion of a disparity estimation algorithm using AVX2. My scalar routine is accurate, but at QVGA resolution and 48 disparities the runtime is disappointingly slow at ~14 ms on my laptop. I create both LR and RL disparity images, but for simplicity here I will only include code for the RL search.

My scalar routine:

int MAXCOST = 32000;
for (int i = maskRadius; i < rstep-maskRadius; i++) {

    // WTA "RL" Search:
    for (int j = maskRadius; j+maskRadius < cstep; j++) {
        int minCost = MAXCOST;
        int minDisp = 0;
        for (int d = 0; d < numDisp && j+d < cstep; d++) {
            if (asPtr[(i*numDisp*cstep)+(d*cstep)+j] < minCost) {
                minCost = asPtr[(i*numDisp*cstep)+(d*cstep)+j];
                minDisp = d;
            }
        }
        dRPtr[(i*cstep)+j] = minDisp;
    }
}

My attempt at using AVX2:

int MAXCOST = 32000;
int* dispVals = (int*) _mm_malloc( sizeof(int32_t)*16, 32 );

for (int i = maskRadius; i < rstep-maskRadius; i++) {

    // WTA "RL" Search AVX2:
    for( int j = 0; j < cstep-16; j+=16) {

        __m256i minCosts = _mm256_set1_epi16( MAXCOST );
        __m128i loMask   = _mm_setzero_si128();
        __m128i hiMask   = _mm_setzero_si128();

        for (int d = 0; d < numDisp && j+d < cstep; d++) {
            // Grab 16 costs to compare
            __m256i costs = _mm256_loadu_si256((__m256i*) (asPtr[(i*numDisp*cstep)+(d*cstep)+j]));

            // Get the new minimums
            __m256i newMinCosts = _mm256_min_epu16( minCosts, costs );

            // Compare new mins to old to build mask to store minDisps
            __m256i mask   = _mm256_cmpgt_epi16( minCosts, newMinCosts );
            __m128i loMask = _mm256_extracti128_si256( mask, 0 );
            __m128i hiMask = _mm256_extracti128_si256( mask, 1 );
            // Sign extend to 32bits
            __m256i loMask32 = _mm256_cvtepi16_epi32( loMask );
            __m256i hiMask32 = _mm256_cvtepi16_epi32( hiMask );

            __m256i currentDisp = _mm256_set1_epi32( d );
            // store min disps with mask
            _mm256_maskstore_epi32( dispVals, loMask32, currentDisp );    // RT error, why?
            _mm256_maskstore_epi32( dispVals+8, hiMask32, currentDisp );  // RT error, why?

            // Set minCosts to newMinCosts
            minCosts = newMinCosts;
        }

        // Write the WTA minimums one-by-one to the RL disparity image
        int index = (i*cstep)+j;
        for( int k = 0; k < 16; k++ ) {
            dRPtr[index+k] = dispVals[k];
        }
    }
}
_mm_free( dispVals );

The Disparity Space Image (DSI) is of size HxWxD (320x240x48), which I lay out horizontally for better memory accesses, such that each row is of size WxD.

The Disparity Space Image has per-pixel matching costs. This aggregated with a simple box filter to make another image of the exact same size, but with costs summed over, say, a 3x3 or 5x5 window. This smoothing makes the result more 'robust'. When I am accessing with asPtr, I am indexing into this aggregated costs image.

Also, in an effort to save on unnecessary computation, I have been starting and ending on rows offset by a mask radius. This mask radius is the radius of my census mask. I could be doing some fancy border reflection, but it is simpler and faster just to not bother with the disparity for this border. This of course applies to the beginning and ending cols too, but messing with indexing here is not good when I am forcing my entire algorithm to run only on images whose columns are a multiple of 16 (ex. QVGA: 320x240) so that I can index simply and hit everything with SIMD (no residual scalar processing).

Also, if you think my code is a mess, I encourage you to check out the the highly optimized OpenCV stereo algorithms. I find them impossible and have been able to make little to no use of them.

My code compiles but fails at runtime. I am using VS 2012 Express Update 4. When I run with the debugger I am unable to gain any insights. I am relatively new to using intrinsics and so I am not sure what information I should expect to see when debugging, number of registers, whether __m256i variables should be visible, etc.

Heeding comment advice below, I improved the scalar time from ~14 to ~8 by using smarter indexing. My CPU is an i7-4980HQ and I successfully use AVX2 intrinsics elsewhere in the same file.

Info Image

Solution

I still haven't found the problem, but I did see some things you might want to change. You're not checking the return value of _mm_malloc, though. If it's failing, that would explain it. (Maybe it doesn't like allocating 32-byte aligned memory?)

If you're running your code under a memory checker or something, then maybe it doesn't like reading from uninitialized memory for dispVals. (_mm256_maskstore_epi32 may count as a read-modify-write even if the mask is all-ones.)

Run your code under a debugger and find out what's going wrong. "runtime error" is not very meaningful.

_mm_set1* functions are slow-ish. VPBROADCASTD needs its source in memory or a vector reg, not a GP reg, so the compiler can either movd from a GP reg to a vector reg and then broadcast, or store to memory and then broadcast. Anyway, it would be faster to do

const __m256i add1 = _mm256_set1_epi32( 1 );
__m256i dvec = _mm256_setzero_si256();
for (d;d...;d++) {
    dvec = _mm256_add_epi32(dvec, add1);
}

Other stuff:
This will probably run faster if you aren't storing to memory every iteration of the inner loop. Use a blend instruction (_mm256_blendv_epi8), or something like that, to update the vector(s) of displacements that go with the min costs. Blend = masked move with a register destination.

Also, your displacement values should fit in 16b integers, so don't sign-extend them to 32b until AFTER you're done finding them. Intel CPUs can sign-extend a 16b memory location into gp register on the fly with no speed penalty (movsz is as fast as mov), so prob. just declare your dRPtr array as uint16_t. Then you don't need the sign-extending stuff in your vector code at all (let alone in your inner loop!). Hopefully _mm256_extracti128_si256( mask, 0 ) compiles to nothing, since the 128 you want is already the low128, so just use the reg as the src for vmovsx, but still.

You can also save an instruction (and a fused-domain uop) by not loading first. (unless the compiler is smart enough not to elide the vmovdqu and use vpminuw with a memory operand, even though you used the load intrinsic).

So I'm thinking something like this:

// totally untested, didn't even check that this compiles.
for(i) { for(j) {
// inner loop, compiler can hoist these constants.
const __m256i add1 = _mm256_set1_epi16( 1 );
__m256i dvec = _mm256_setzero_si256();
__m256i minCosts = _mm256_set1_epi16( MAXCOST );
__m256i minDisps = _mm256_setzero_si256();

for (int d=0 ; d < numDisp && j+d < cstep ;
     d++, dvec = _mm256_add_epi16(dvec, add1))
{
    __m256i newMinCosts = _mm256_min_epu16( minCosts, asPtr[(i*numDisp*cstep)+(d*cstep)+j]) );
    __m256i mask   = _mm256_cmpgt_epi16( minCosts, newMinCosts );
    minDisps = _mm256_blendv_epi8(minDisps, dvec, mask); // 2 uops, latency=2
    minCosts = newMinCosts;
}

// put sign extension here if making dRPtr uint16_t isn't an option.
int index = (i*cstep)+j;
_mm256_storeu_si256 (dRPtr + index, __m256i minDisps);
}}

You might get better performance having two parallel dependency chains: minCosts0 / minDisps0, and minCosts1 / minDisps1, and then combining them at the end. minDisps is a loop-carried dependency, but the loop only has 5 instructions (including the vpadd, which looks like loop overhead but can't be reduced by unrolling). They decode to 6 uops (blendv is 2), plus loop overhead. It should run in 1.5cycles / iteration (not counting loop overhead) on haswell, but the dep chain would limit it to one iteration per 2 cycles. (Assuming unrolling to get rid of loop overhead). Doing two dep chains in parallel fixes this, and has the same effect as unrolling the loop: less loop overhead.

Hmm, actually on Haswell,

pminuw can run on p1/p5. (and the load part on p2/p3)
pcmpgtw can run on p1/p5
vpblendvb is 2 uops for p5.
padduw can run on p1/p5
movdqa reg,reg can run on p0/p1/p5 (and may not need an execution unit at all). Unrolling should get rid of any overhead for minCosts = newMinCosts, since the compiler can just end up with newMinCosts from the last unrolled loop body in the right register for the first loop body of the next iteration.
fused sub / jge (loop counter) can run on p6. (using PTEST + jcc on dvec would be slower). add/sub can run on p0/p1/p5/p6 when not fused with a jcc.

Ok, so actually the loop will take 2.5 cycles per iteration, limited by instructions that can only run on p1/p5. Unrolling by 2 or 4 will reduce the loop / movdqa overhead. Since Haswell can issue 4 uops per clock, it can then more efficiently queue up uops for out-of-order execution, since the loop won't have a super-high number of iterations. (48 was your example.) Having lots of uops queued up will give the CPU something to do after leaving the loop, and hide any latencies from cache misses, etc.

_mm256_min_epu16 (PMINUW) is another loop-carried dependency chain. Using it with a memory operand makes it a 3 or 4-cycle latency. However, the load part of the instruction can start as soon as the address is known, so folding a load into a modify op to take advantage of micro-fusion doesn't make dep chains any longer or shorter than using a separate load.

Sometimes you need to use a separate load, for unaligned data (AVX removed the alignment requirement for memory operands). We're limited more by execution units than the 4 uop / clock issue limit, so it's probably fine to use a dedicated load instruction.

source for insn ports / latencies.