Search code examples
c++simdloop-unrollingauto-vectorization

Unrolling pointer increment loop for auto vectorization


I was wondering if unrolling this loop:

for (int i = 0; i < n; i++) {
    *p = a[i]*b[i];
    p++;
}

into

for (int i = 0; i < n; i+=4) {
    *(p + 0) = a[i + 0]*b[i + 0];
    *(p + 1) = a[i + 1]*b[i + 1];
    *(p + 2) = a[i + 2]*b[i + 2];
    *(p + 3) = a[i + 3]*b[i + 3];
    p+=4;
}

would help the compiler in terms of auto-vectorization.


I can imagine that it will vectorize the first loop anyway. But does being explicit help?


Solution

  • For successful auto-vectorization, your compiler needs to be able to determine that the involved variables do not alias, which means the compiler needs certainty that a, b and p never overlap, e.g.:

    void somefunction()
    {
        int a[12] = { ... };
        int b[12] = { ... }
        int p[12];
    
        /* Compiler knows: a, b and p do not overlap */
    }
    
    void multiply(int n, int* p, int* a, int* b)
    {
        /* Compiler unsure: a, b and p could overlap, e.g.:
             multiply(8, array1, array1, array1);
           or worse:
             multiply(8, array1 + 1, array1, array1 + 2);
        */
    }
    

    If they do overlap, the first iteration could influence the next one and therefore they cannot be performed in parallel.

    For a function, you can actually promise the compiler that arguments will not overlap by using the restrict keyword. Unfortunately only officially supported in the C standard and not yet C++. However, a lot of C++ compilers support a similar keyword, e.g. __restrict__ for gcc and clang, and __restrict for MSVC. For example for gcc:

    void multiply(int n, int* __restrict__ p, int* __restrict__ a, int* __restrict__ b)
    
    {
        for (int i = 0; i < n; i++) {
           p[i] = a[i]*b[i];
        }
    }
    

    The resulting code (using gcc -O2 -ftree-vectorize) seems pretty decent:

    multiply(int, int*, int*, int*):
            test    edi, edi
            jle     .L1
            lea     r8d, [rdi-4]
            lea     r9d, [rdi-1]
            shr     r8d, 2
            add     r8d, 1
            cmp     r9d, 2
            lea     eax, [0+r8*4]
            jbe     .L9
            xor     r9d, r9d
            xor     r10d, r10d
    .L5:
            movdqu  xmm0, XMMWORD PTR [rdx+r9]
            add     r10d, 1
            movdqu  xmm2, XMMWORD PTR [rcx+r9]
            movdqa  xmm1, xmm0
            psrlq   xmm0, 32
            pmuludq xmm1, xmm2
            psrlq   xmm2, 32
            pshufd  xmm1, xmm1, 8
            pmuludq xmm0, xmm2
            pshufd  xmm0, xmm0, 8
            punpckldq       xmm1, xmm0
            movups  XMMWORD PTR [rsi+r9], xmm1
            add     r9, 16
            cmp     r10d, r8d
            jb      .L5
            cmp     eax, edi
            je      .L12
    .L3:
            cdqe
    .L7:
            mov     r8d, DWORD PTR [rdx+rax*4]
            imul    r8d, DWORD PTR [rcx+rax*4]
            mov     DWORD PTR [rsi+rax*4], r8d
            add     rax, 1
            cmp     edi, eax
            jg      .L7
            rep ret
    .L1:
            rep ret
    .L12:
            rep ret
    .L9:
            xor     eax, eax
            jmp     .L3
    

    Update: Without the restrict keyword, gcc apparently generates a function that detects aliasing and generates code for both scenarios.

    By the way, your unrolled version does not account for the situation where n is not a multiple of 4 and is therefore functionally different!