I would like to have a general understanding of when I can expect a compiler to vectorize a loop and when it is worth for me to unroll the loop to help it decides to use vectorization.
I understand the details are very important (what compiler, what compilation options, what architecture, how do I write the code in the loop, etc), but I wonder if there are some general guidelines for modern compilers.
I will be more specific giving an example with a simple loop (the code is not supposed to compute anything useful):
double *A,*B; // two arrays
int delay = something
[...]
double numer = 0, denomB = 0, denomA = 0;
for (int idxA = 0; idxA < Asize; idxA++)
{
int idxB = idxA + (Bsize-Asize)/2 + delay;
numer += A[idxA] * B[idxB];
denomA += A[idxA] * A[idxA];
denomB += B[idxB] * B[idxB];
}
Can I expect a compiler to vectorize the loop or would it be useful to rewrite the code like the following?
for ( int idxA = 0; idxA < Asize; idxA+=4 )
{
int idxB = idxA + (Bsize-Asize)/2 + delay;
numer += A[idxA] * B[idxB];
denomA += A[idxA] * A[idxA];
denomB += B[idxB] * B[idxB];
numer += A[idxA+1] * B[idxB+1];
denomA += A[idxA+1] * A[idxA+1];
denomB += B[idxB+1] * B[idxB+1];
numer += A[idxA+2] * B[idxB+2];
denomA += A[idxA+2] * A[idxA+2];
denomB += B[idxB+2] * B[idxB+2];
numer += A[idxA+3] * B[idxB+3];
denomA += A[idxA+3] * A[idxA+3];
denomB += B[idxB+3] * B[idxB+3];
}
Short answer, as others said : there is no general guidelines if you do not specify compiler nor target architecture.
As a remark, it is generally better to let the compiler optimize the code these days because it "knows" better the architecture possibilities. There is some cases where unrolling the loops will not be faster.
If someone see this and need it, there is the -funroll-loops
flag in GCC.