Search code examples
c++clangllvmvectorizationauto-vectorization

Why is vectorization not beneficial in this for loop?


I am trying to vectorize this for loop. After using the Rpass flag, I am getting the following remark for it:

int someOuterVariable = 0;

for (unsigned int i = 7; i != -1; i--)
{
  array[someOuterVariable + i] -= 0.3 * anotherArray[i];
}

Remark:
The cost-model indicates that vectorization is not beneficial
the cost-model indicates that interleaving is not beneficial

I want to understand what this means. Does "interleaving is not benificial" mean the array indexing is not proper?


Solution

  • It's hard to answer without more details about your types. But in general, starting a loop incurs some costs and vectorising also implies some costs (such as moving data to/from SIMD registers, ensuring proper alignment of data)

    I'm guessing here that the compiler tells you that the vectorisation cost here is bigger than simply running the 8 iterations without it, so it's not doing it.

    Try to increase the number of iterations, or help the compiler for computing alignement for example.

    Typically, unless the type of array's item are exactly of the proper alignment for SIMD vector, accessing an array from a "unknown" offset (what you've called someOuterVariable) prevents the compiler to write an efficient vectorisation code.

    EDIT: About the "interleaving" question, it's hard to guess without knowning your tool. But in general, interleaving usually means mixing 2 streams of computations so that the compute units of the CPU are all busy. For example, if you have 2 ALU in your CPU, and the program is doing:

    c = a + b;
    d = e * f;
    

    The compiler can interleave the computation so that both the addition and multiplication happens at the same time (provided you have 2 ALU available). Typically, this means that the multiplication which is a bit longer to compute (for example 6 cycles) will be started before the addition (for example 3 cycles). You'll then get the result of both operation after only 6 cycles instead of 9 if the compiler serialized the computations. This is only possible if there is no dependencies between the computation (if d required c, it can not work). A compiler is very cautious about this, and, in your example, will not apply this optimization if it can't prove that array and anotherArray don't alias.