I have a rather simple for loop summing a very large array of double values x
(100 mio data points) in C. I want to do this in parallel with SIMD reductions, using a specified number of threads. The OpenMP instruction in my reading should be:
int nthreads = 4, l = 1e8;
double sum = 0.0;
#pragma omp parallel for simd num_threads(nthreads) reduction(+:sum)
for (int i = 0; i < l; ++i) sum += x[i];
This however gives a compiler warning
loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
and running it with multiple threads is slower than single threaded. I'm using the Apple M1 Mac with clang
(Xclang
) v13.0.0 compiler. What I would like to know is: is this an issue with my system or is there actually something wrong / infeasible with this OpenMP instruction?
This compiles without warning on clang >= 15, but performance depends on the system. With the Apple M1 it seems that multithreading does not add much to the SIMD vectorization and single threaded execution with a #pragma omp simd reduction(+:sum)
instruction is about as good as it gets.