Search code examples
c++visual-studio-2012assemblycode-generationpipelining

Why does Visual Studio increment the loop pointer before dereferencing it?


I checked out Visual Studio 2012's assembly output from the following SIMD code:

    float *end = arr + sz;
    float *b = other.arr;
    for (float *a = arr; a < end; a += 4, b += 4)
    {
        __m128 ax = _mm_load_ps(a);
        __m128 bx = _mm_load_ps(b);
        ax = _mm_add_ps(ax, bx);
        _mm_store_ps(a, ax);
    }

The loop body is:

$LL11@main:
    movaps  xmm1, XMMWORD PTR [eax+ecx]
    addps   xmm1, XMMWORD PTR [ecx]
    add ecx, 16                 ; 00000010H
    movaps  XMMWORD PTR [ecx-16], xmm1
    cmp ecx, edx
    jb  SHORT $LL11@main

Why increment ecx by 16, only to subtract 16 when storing to it the next line?


Solution

  • Well, there are basically two options here.

     add ecx, 16
     movaps XMMWORD PTR [ecx-16], xmm1 ; stall for ecx?
     cmp ecx, edx
     jb loop
    

    or

     movaps XMMWORD PTR [ecx], xmm1
     add ecx, 16
     cmp ecx, edx ; stall for ecx?
     jb loop
    

    In option 1 you have a potential stall between add and movaps. In option 2 you have a potential stall between add and cmp. However, there is also the issue of the execution unit used. add and cmp (=sub) use the ALU, while the [ecx-16] uses AGU (Address Generation Unit), I believe. So I suspect there might be a slight win in option 1 because ALU use is interleaved with AGU use.