gcc vectorization openmp compiler-optimization icc

Unable to vectorize using only OpenMP

I'm trying to understand some basics on how to vectorize my code for performance.

Question: With -O0 I tried to use the OpenMP SIMD directive as follows:

    struct aligned_free
    {
        inline void operator()(double* ptr)
        {
            if (ptr != nullptr)
            {
                std::free(ptr);
            }
        }
    };
    using unique_ptr_aligned_double = std::unique_ptr<double, aligned_free>;
    auto result = unique_ptr_aligned_double(static_cast<double*>(std::aligned_alloc(64, n * sizeof(double))), aligned_free());
    const auto list_a = unique_ptr_aligned_double(static_cast<double*>(std::aligned_alloc(64, n * sizeof(double))), aligned_free());
    const auto list_b = unique_ptr_aligned_double(static_cast<double*>(std::aligned_alloc(64, n * sizeof(double))), aligned_free());
    const auto list_c = unique_ptr_aligned_double(static_cast<double*>(std::aligned_alloc(64, n * sizeof(double))), aligned_free());
    const auto list_d = unique_ptr_aligned_double(static_cast<double*>(std::aligned_alloc(64, n * sizeof(double))), aligned_free());

    auto* r = result.get();
    const auto* a = list_a.get();
    const auto* b = list_b.get();
    const auto* c = list_c.get();
    const auto* d = list_d.get();

    auto k_index = std::size_t{};
    #pragma omp simd safelen(4) linear(k_index:1)
    for (; k_index < n; ++k_index)
    {
        r[k_index] = a[k_index] * b[k_index];
    }

To my disappointment, the above code was not vectorized. What am I missing here? Should I unroll the loop manually or help the compiler in any other way? Or should I just start using compiler intrinsics instead of OpenMP?

Side-topic Question: While doing this, I asked myself the question: is it better to use the compiler flag -O2/O3 or rather to hand-pick the optimizations which I wish to have e.g., -xHost, -faggressive-loop-optimizations etc.? It seems to my (uninformed) self that it would perhaps be a better idea to have more exact knowledge over what my compiler does.

Solution

On GCC, ICC and Clang, omp simd impacts the auto-vectorization optimization step (by providing meta information to loops). However, the step is only enabled if optimization are enabled. Thus, the pragma annotation is simply ignored with -O0 for the three compiler. This is an expected behaviour. Here is the result you can get from the three compilers.

Some compilers enable the auto-vectorization in -O2 (ICC) while some does that in -O3 (GCC and probably Clang). Because -On (with n an integer) is just a set of well defined optimizations (which changes from one compiler to another). You can specify the optimization flags required to vectorize the loop (e.g. -ftree-vectorize for GCC). While this tends to be better if you use one specific compiler (more deterministic and finer grained control), this is not great for portability (options are not the same for all compilers and may change between versions).

Moreover, note that you should not forget to use the -fopenmp-simd for GCC/Clang and -qopenmp-simd for ICC. It is especially important for Clang. Note also that k_index = 0 is needed in the loop.

Finally, compilers tends not to use AVX, AVX2 and AVX-512 instructions on x86/x86-64 platforms by default because it is not available on all processors (instead old SSE instructions are used). Using for example -mavx/-mavx2 enable GCC/Clang to generate wider SIMD instruction (that are often faster). Using -march=native is better if you do plan not to distribute the generated binaries nor to execute them on another machine (otherwise the generated binary can simply crash if instructions are unsupported on the target machine). Alternatively you can specify a specific architecture like -march=skylake. ICC has similar options/flags.

With all of that, Clang, GCC and ICC are able to generate a proper SIMD implementation (see here for the generated code).