c++visual-studio-2012 assembly code-generation pipelining

Why does Visual Studio increment the loop pointer before dereferencing it?

I checked out Visual Studio 2012's assembly output from the following SIMD code:

    float *end = arr + sz;
    float *b = other.arr;
    for (float *a = arr; a < end; a += 4, b += 4)
    {
        __m128 ax = _mm_load_ps(a);
        __m128 bx = _mm_load_ps(b);
        ax = _mm_add_ps(ax, bx);
        _mm_store_ps(a, ax);
    }

The loop body is:

$LL11@main:
    movaps  xmm1, XMMWORD PTR [eax+ecx]
    addps   xmm1, XMMWORD PTR [ecx]
    add ecx, 16                 ; 00000010H
    movaps  XMMWORD PTR [ecx-16], xmm1
    cmp ecx, edx
    jb  SHORT $LL11@main

Why increment ecx by 16, only to subtract 16 when storing to it the next line?

Solution

Well, there are basically two options here.

 add ecx, 16
 movaps XMMWORD PTR [ecx-16], xmm1 ; stall for ecx?
 cmp ecx, edx
 jb loop

 movaps XMMWORD PTR [ecx], xmm1
 add ecx, 16
 cmp ecx, edx ; stall for ecx?
 jb loop

In option 1 you have a potential stall between add and movaps. In option 2 you have a potential stall between add and cmp. However, there is also the issue of the execution unit used. add and cmp (=sub) use the ALU, while the [ecx-16] uses AGU (Address Generation Unit), I believe. So I suspect there might be a slight win in option 1 because ALU use is interleaved with AGU use.