Check the following code:
#include <stdio.h>
#include <omp.h>
#define ARRAY_SIZE (1024)
float A[ARRAY_SIZE];
float B[ARRAY_SIZE];
float C[ARRAY_SIZE];
int main(void)
{
for (int i = 0; i < ARRAY_SIZE; i++)
{
A[i] = i * 2.3;
B[i] = i + 4.6;
}
double start = omp_get_wtime();
for (int loop = 0; loop < 1000000; loop++)
{
#pragma omp simd
for (int i = 0; i < ARRAY_SIZE; i++)
{
C[i] = A[i] * B[i];
}
}
double end = omp_get_wtime();
printf("Work consumed %f seconds\n", end - start);
return 0;
}
Build and run it on my machine, it outputs:
$ gcc -fopenmp parallel.c
$ ./a.out
Work consumed 2.084107 seconds
If I comment out "#pragma omp simd
", build and run it again:
$ gcc -fopenmp parallel.c
$ ./a.out
Work consumed 2.112724 seconds
We can see "#pragma omp simd
" doesn't get big performance gain. But if I add -O2
option, no "#pragma omp simd
":
$ gcc -O2 -fopenmp parallel.c
$ ./a.out
Work consumed 0.446662 seconds
With "#pragma omp simd
":
$ gcc -O2 -fopenmp parallel.c
$ ./a.out
Work consumed 0.126799 seconds
We can see a big improvement. But if use -O3
, no "#pragma omp simd
":
$ gcc -O3 -fopenmp parallel.c
$ ./a.out
Work consumed 0.127563 seconds
with "#pragma omp simd
":
$ gcc -O3 -fopenmp parallel.c
$ ./a.out
Work consumed 0.126727 seconds
We can see the results are similar again.
Why does "#pragma omp simd
" only take big performance improvement in -O2
under gcc
compiler?
Forget about timing with -O0
, it's a total waste of time.
gcc -O3
attempts to auto-vectorize all loops, so using OpenMP pragmas only helps you for loops that otherwise would only auto-vectorize with -ffast-math
, restrict
qualifiers, or other obstacles to correctness under all possible circumstances which the compiler has to satisfy for auto-vectorization of pure C. (Apparently no obstacles here: here it's not a reduction, and you have purely vertical operations. And you're operating on static arrays so the compiler can see they don't overlap)
gcc -O2
does not enable -ftree-vectorize
, so you only get auto-vectorization if you use OpenMP pragmas to ask for it on specific loops.
Note that clang
enables auto-vectorization at -O2
.
GCC auto-vectorization strategies may differ between OpenMP and vanilla. IIRC, for OpenMP loops, gcc may just use unaligned loads / stores instead of going scalar until reaching an alignment boundary. This has no perf downside with AVX if the data is aligned at runtime, even if that fact wasn't known at compile time. It saves a lot of code bloat vs. gcc's massive fully-unrolled startup / cleanup code.
It makes sense that if you're asking for SIMD vectorization with OpenMP, you've probably aligned your data to avoid cache-line splits. But C doesn't make it very convenient to pass along the fact that a pointer to float
has more alignment than the width of a float
. (Especially that it usually has that property, even if you need the function to still work in the rare cases when it doesn't).