I'm using OpenMP to parallelize the loop, that is internally using AVX-512 with Agner Fog's VCL Vector Class Library.
Here is the code:
double HarmonicSeries(const unsigned long long int N) {
unsigned long long int i;
Vec8d divV(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0);
Vec8d sumV(0.0);
const Vec8d addV(8.0);
const Vec8d oneV(1.0);
#pragma omp parallel for reduction(+:sumV,divV)
for(i=0; i<N; ++i) {
sumV += oneV / divV;
divV += addV;
}
return horizontal_add(sumV);
}
When trying to compile the code above, I'm getting
g++ -Wall -Wextra -O3 -g -I include -fopenmp -m64 -mavx2 -mfma -std=c++17 -o harmonic_series harmonic_series.cpp
harmonic_series.cpp:87:40: error: user defined reduction not found for ‘sumV’
87 | #pragma omp parallel for reduction(+:sumV,divV)
| ^~~~
harmonic_series.cpp:87:45: error: user defined reduction not found for ‘divV’
87 | #pragma omp parallel for reduction(+:sumV,divV)
Any hints on how to solve this and provide the user-defined reduction for the Vec8d
class? It's simply the plus operator which is defined by the VCL class, but I cannot find any example how to code this.
Thanks a lot for any help!
Here is the final solution. It's using automatic reduction and avoids u64->fp conversion by computing divV = startdivV + i * addV
only in the first iteration of each loop and then using divV += addV
for all other iterations. Runtime to compute the sum of first 9.6e10 elements is {real 9s, user 1m46s} with 12 threads on Intel Core i7-10850H CPU - it's the same as for the manual reduction.
double HarmonicSeries(const unsigned long long int N) {
unsigned long long int i;
Vec8d divV(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0);
Vec8d sumV(0.0);
const Vec8d addV(8.0);
const Vec8d oneV(1.0);
const Vec8d startdivV = divV;
bool first_loop = true;
#pragma omp declare reduction( + : Vec8d : omp_out = omp_out + omp_in ) initializer (omp_priv=omp_orig)
//It's important to mark "first_loop" variable as firstprivate so that each private copy gets initialized.
#pragma omp parallel for firstprivate(first_loop) lastprivate(divV) reduction(+:sumV)
for(i=0; i<N; ++i) {
if (first_loop) {
divV = startdivV + i * addV;
first_loop = false;
} else {
divV += addV;
}
sumV += oneV / divV;
}
return horizontal_add(sumV);
}