I was wondering if unrolling this loop:
for (int i = 0; i < n; i++) {
*p = a[i]*b[i];
p++;
}
into
for (int i = 0; i < n; i+=4) {
*(p + 0) = a[i + 0]*b[i + 0];
*(p + 1) = a[i + 1]*b[i + 1];
*(p + 2) = a[i + 2]*b[i + 2];
*(p + 3) = a[i + 3]*b[i + 3];
p+=4;
}
would help the compiler in terms of auto-vectorization.
I can imagine that it will vectorize the first loop anyway. But does being explicit help?
For successful auto-vectorization, your compiler needs to be able to determine that the involved variables do not alias, which means the compiler needs certainty that a, b and p never overlap, e.g.:
void somefunction()
{
int a[12] = { ... };
int b[12] = { ... }
int p[12];
/* Compiler knows: a, b and p do not overlap */
}
void multiply(int n, int* p, int* a, int* b)
{
/* Compiler unsure: a, b and p could overlap, e.g.:
multiply(8, array1, array1, array1);
or worse:
multiply(8, array1 + 1, array1, array1 + 2);
*/
}
If they do overlap, the first iteration could influence the next one and therefore they cannot be performed in parallel.
For a function, you can actually promise the compiler that arguments will not overlap by using the restrict keyword. Unfortunately only officially supported in the C standard and not yet C++. However, a lot of C++ compilers support a similar keyword, e.g. __restrict__
for gcc and clang, and __restrict
for MSVC. For example for gcc:
void multiply(int n, int* __restrict__ p, int* __restrict__ a, int* __restrict__ b)
{
for (int i = 0; i < n; i++) {
p[i] = a[i]*b[i];
}
}
The resulting code (using gcc -O2 -ftree-vectorize
) seems pretty decent:
multiply(int, int*, int*, int*):
test edi, edi
jle .L1
lea r8d, [rdi-4]
lea r9d, [rdi-1]
shr r8d, 2
add r8d, 1
cmp r9d, 2
lea eax, [0+r8*4]
jbe .L9
xor r9d, r9d
xor r10d, r10d
.L5:
movdqu xmm0, XMMWORD PTR [rdx+r9]
add r10d, 1
movdqu xmm2, XMMWORD PTR [rcx+r9]
movdqa xmm1, xmm0
psrlq xmm0, 32
pmuludq xmm1, xmm2
psrlq xmm2, 32
pshufd xmm1, xmm1, 8
pmuludq xmm0, xmm2
pshufd xmm0, xmm0, 8
punpckldq xmm1, xmm0
movups XMMWORD PTR [rsi+r9], xmm1
add r9, 16
cmp r10d, r8d
jb .L5
cmp eax, edi
je .L12
.L3:
cdqe
.L7:
mov r8d, DWORD PTR [rdx+rax*4]
imul r8d, DWORD PTR [rcx+rax*4]
mov DWORD PTR [rsi+rax*4], r8d
add rax, 1
cmp edi, eax
jg .L7
rep ret
.L1:
rep ret
.L12:
rep ret
.L9:
xor eax, eax
jmp .L3
Update: Without the restrict keyword, gcc apparently generates a function that detects aliasing and generates code for both scenarios.
By the way, your unrolled version does not account for the situation where n is not a multiple of 4 and is therefore functionally different!