Search code examples
c++optimizationvectorizationmulticoresimd

How to optimize a simple loop?


The loop is simple

void loop(int n, double* a, double const* b)
{
#pragma ivdep
    for (int i = 0; i < n; ++i, ++a, ++b)
        *a *= *b;
}

I am using intel c++ compiler and using #pragma ivdep for optimization currently. Any way to make it perform better like using multicore and vectorization together, or other techniques?


Solution

    1. This loop is absolutely vectorizable by compiler. But make sure that loop was actually vectorized (using Compiler' -qopt-report5, assembly output, Intel (vectorization) Advisor, whatever other techniques). One more overkill way to do that is creating performance baseline using -no-vec option (which will disable ivdep-driven and auto-vectorization) and then compare execution time against it. This is not good way for checking vectorization presence, but it's useful for general performance analysis for next bullets.

    If loop hasn't been actually vectorized, make sure you push compiler to auto-vectorize it. In order to push compiler see next bullet. Note that next bullet could be useful even if loop was succesfully auto-vectorized.

    1. To push compiler to vectorize it use: (a) restrict keyword to "disambiguate" a and b pointers (someone has already suggested it to you). (b) #pragma omp simd (which has extra bonus of being more portable and much more flexible than ivdep, but also has a drawback of being unsupported in old compilers before intel compiler version 14 and for other loops is more "dangerous"). To re-emphasize: given bullet may seem to do the same thing as ivdep, but depending on various circumstances it could be better and more powerful option.

    2. Given loop has fine-grain iterations (too small amount of computations per single iteration) and overall is not purely compute-bound (so effort/cycles spent by CPU to load/store data from/to cache/memory is comparable if not bigger to effort/cycles spent to perform multiplication). Unrolling is often good way to slightly mitigate such disadvantages. But I would recommend to explicitly ask compiler to unroll it, by using #pragma unroll. In fact, for certain compiler versions the unrolling will happen automatically. Again, you can check whenever compiler did it by using -qopt-report5, loop assembly or Intel (Vectorization) Advisor: enter image description here

    3. In given loop you deal with "streaming" access pattern. I.e. you are contiguously loading/store data from/to memory (and cache sub-system will not help a lot for big "n" values). So, depending on target hardware, usage of multi-threading (atop of SIMD), etc, your loop will likely become memory bandwidth bound in the end. Once you become memory bandwidth bound, you could use techniques like loop blocking, non-temporal stores, aggressive prefetching. All of these techniques worth separate article, although for prefetching/NT-stores you have some pragmas in Intel Compiler to play with.

    4. If n is huge, and you already got prepared to memory bandwidth troubles, you could use things like #pragma omp parallel for simd, which will simulteneously thread-parallelize and vectorize the loop. However quality of this feature has been made decent only in very fresh compiler versions AFAIK, so maybe you'd prefer to split n semi-manually. I.e. n=n1xn2xn3, where n1 - is number of iterations to distribute among threads, n2 - for cache blocking, n3 - for vectorization. Rewrite given loop to make it loopnest of 3 nested loops, where outer loop has n1 iterations (and #pragma omp parallel for is applied), next level loop has n2 iterations, n3 - is innermost (where #pragma omp simd is applied).

    Some up to date links with syntax examples and more info:

    Note1: I apologize that I don't provide various code snippets here. There are at least 2 justifiable reasons for not providing them here: 1. My 5 bullets are pretty much applicable to very many kernels, not just to yours. 2. On the other hand specific combination of pragmas/manual rewriting techniques and corresponding performance results will vary depending on target platform, ISA and Compiler version.

    Note2: Last comment regarding your GPU question. Think of your loop vs. simple industry benchmarks like LINPACK or STREAM. In fact your loop could become somewhat very similar to some of them in the end. Now think of x86 CPUs and especially Intel Xeon Phi platform characteristics for LINPACK/STREAM. They are very good indeed and will become even better with High Bandwidth Memory platforms (like Xeon Phi 2nd gen). So theoretically there is no any single reason to think that your given loop is not well mapped to at least some variants of x86 hardware (note that I didn't say similar thing for arbitrary kernel in universe).