Regarding OpenMP Parallel SIMD Reductions

I have a rather simple for loop summing a very large array of double values x (100 mio data points) in C. I want to do this in parallel with SIMD reductions, using a specified number of threads. The OpenMP instruction in my reading should be:

int nthreads = 4, l = 1e8;
double sum = 0.0;

#pragma omp parallel for simd num_threads(nthreads) reduction(+:sum)
for (int i = 0; i < l; ++i) sum += x[i];

This however gives a compiler warning

loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]

and running it with multiple threads is slower than single threaded. I'm using the Apple M1 Mac with clang (Xclang) v13.0.0 compiler. What I would like to know is: is this an issue with my system or is there actually something wrong / infeasible with this OpenMP instruction?

Solution

This compiles without warning on clang >= 15, but performance depends on the system. With the Apple M1 it seems that multithreading does not add much to the SIMD vectorization and single threaded execution with a #pragma omp simd reduction(+:sum) instruction is about as good as it gets.