OpenMP vectorised code runs way slower than O3 optimized code

I have a minimally reproducible sample which is as follows -

#include <iostream>
#include <chrono>
#include <immintrin.h>
#include <vector>
#include <numeric>



template<typename type>
void AddMatrixOpenMP(type* matA, type* matB, type* result, size_t size){
        for(size_t i=0; i < size * size; i++){
            result[i] = matA[i] + matB[i];
        }
}


int main(){
    size_t size = 8192;

    //std::cout<<sizeof(double) * 8<<std::endl;
    

    auto matA = (float*) aligned_alloc(sizeof(float), size * size * sizeof(float));
    auto matB = (float*) aligned_alloc(sizeof(float), size * size * sizeof(float));
    auto result = (float*) aligned_alloc(sizeof(float), size * size * sizeof(float));


    for(int i = 0; i < size * size; i++){
        *(matA + i) = i;
        *(matB + i) = i;
    }

    auto start = std::chrono::high_resolution_clock::now();

    for(int j=0; j<500; j++){
    
    AddMatrixOpenMP<float>(matA, matB, result, size);
    
}

    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();

    std::cout<<"Average Time is = "<<duration/500<<std::endl;
    std::cout<<*(result + 100)<<"  "<<*(result + 1343)<<std::endl;

}

I experiment as follows - I time the code with #pragma omp for simd directive for the loop in the AddMatrixOpenMP function and then time it without the directive. I compile the code as follows - g++ -O3 -fopenmp example.cpp

Upon inspecting the assembly, both the variants generate vector instructions but when the OpenMP pragma is explicitly specified, the code runs 3 times slower.
I am not able to understand why so.

Edit - I am running GCC 9.3 and OpenMP 4.5. This is running on an i7 9750h 6C/12T on Ubuntu 20.04. I ensured no major processes were running in the background. The CPU frequency held more or less constant during the run for both versions (Minor variations from 4.0 to 4.1)

TIA

Solution

The non-OpenMP vectorizer is defeating your benchmark with loop inversion.
Make your function __attribute__((noinline, noclone)) to stop GCC from inlining it into the repeat loop. For cases like this with large enough functions that call/ret overhead is minor, and constant propagation isn't important, this is a pretty good way to make sure that the compiler doesn't hoist work out of the loop.

And in future, check the asm, and/or make sure the benchmark time scales linearly with the iteration count. e.g. increasing 500 up to 1000 should give the same average time in a benchmark that's working properly, but it won't with -O3. (Although it's surprisingly close here, so that smell test doesn't definitively detect the problem!)

After adding the missing #pragma omp simd to the code, yeah I can reproduce this. On i7-6700k Skylake (3.9GHz with DDR4-2666) with GCC 10.2 -O3 (without -march=native or -fopenmp), I get 18266, but with -O3 -fopenmp I get avg time 39772.

With the OpenMP vectorized version, if I look at top while it runs, memory usage (RSS) is steady at 771 MiB. (As expected: init code faults in the two inputs, and the first iteration of the timed region writes to result, triggering page-faults for it, too.)

But with the "normal" vectorizer (not OpenMP), I see the memory usage climb from ~500 MiB until it exits just as it reaches the max 770MiB.

So it looks like gcc -O3 performed some kind of loop inversion after inlining and defeated the memory-bandwidth-intensive aspect of your benchmark loop, only touching each array element once.

The asm shows the evidence: GCC 9.3 -O3 on Godbolt doesn't vectorize, and it leaves an empty inner loop instead of repeating the work.

.L4:                    # outer loop
        movss   xmm0, DWORD PTR [rbx+rdx*4]
        addss   xmm0, DWORD PTR [r13+0+rdx*4]        # one scalar operation
        mov     eax, 500
.L3:                             # do {
        sub     eax, 1                   # empty inner loop after inversion
        jne     .L3              # }while(--i);

        add     rdx, 1
        movss   DWORD PTR [rcx], xmm0
        add     rcx, 4
        cmp     rdx, 67108864
        jne     .L4

This is only 2 or 3x faster than fully doing the work. Probably because it's not vectorized, and it's effectively running a delay loop instead of optimizing away the empty inner loop entirely. And because modern desktops have very good single-threaded memory bandwidth.

Bumping up the repeat count from 500 to 1000 only improved the computed "average" from 18266 to 17821 us per iter. An empty loop still takes 1 iteration per clock. Normally scaling linearly with the repeat count is a good litmus test for broken benchmarks, but this is close enough to be believable.

There's also the overhead of page faults inside the timed region, but the whole thing runs for multiple seconds so that's minor.

The OpenMP vectorized version does respect your benchmark repeat-loop. (Or to put it another way, doesn't manage to find the huge optimization that's possible in this code.)

Looking at memory bandwidth while the benchmark is running:

Running intel_gpu_top -l while the proper benchmark is running shows (openMP, or with __attribute__((noinline, noclone))). IMC is the Integrated Memory Controller on the CPU die, shared by the IA cores and the GPU via the ring bus. That's why a GPU-monitoring program is useful here.

$ intel_gpu_top -l
 Freq MHz      IRQ RC6 Power     IMC MiB/s           RCS/0           BCS/0           VCS/0          VECS/0 
 req  act       /s   %     W     rd     wr       %  se  wa       %  se  wa       %  se  wa       %  se  wa 
   0    0        0  97  0.00  20421   7482    0.00   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
   3    4       14  99  0.02  19627   6505    0.47   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
   7    7       20  98  0.02  19625   6516    0.67   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
  11   10       22  98  0.03  19632   6516    0.65   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
   3    4       13  99  0.02  19609   6505    0.46   0   0    0.00   0   0    0.00   0   0    0.00   0   0

Note the ~19.6GB/s read / 6.5GB/s write. Read ~= 3x write since it's not using NT stores for the output stream.

But with -O3 defeating the benchmark, with a 1000 repeat count, we see only near-idle levels of main-memory bandwidth.

 Freq MHz      IRQ RC6 Power     IMC MiB/s           RCS/0           BCS/0           VCS/0          VECS/0 
 req  act       /s   %     W     rd     wr       %  se  wa       %  se  wa       %  se  wa       %  se  wa 
...
   8    8       17  99  0.03    365     85    0.62   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
   9    9       17  99  0.02    349     90    0.62   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
   4    4        5 100  0.01    303     63    0.25   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
   7    7       15 100  0.02    345     69    0.43   0   0    0.00   0   0    0.00   0   0    0.00   0   0 
  10   10       21  99  0.03    350     74    0.64   0   0    0.00   0   0    0.00   0   0    0.00   0   0

vs. a baseline of 150 to 180 MB/s read, 35 to 50MB/s write when the benchmark isn't running at all. (I have some programs running that don't totally sleep even when I'm not touching the mouse / keyboard.)