Search code examples
c++parallel-processingopenmpavxvector-class-library

How to use Vector Class Library for AVX vectorization together with the openmp #pragma omp parallel for reduction?


I'm using OpenMP to parallelize the loop, that is internally using AVX-512 with Agner Fog's VCL Vector Class Library.

Here is the code:

double HarmonicSeries(const unsigned long long int N) {
  unsigned long long int i;
  Vec8d divV(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0);
  Vec8d sumV(0.0);
  const Vec8d addV(8.0);
  const Vec8d oneV(1.0);

  #pragma omp parallel for reduction(+:sumV,divV)
  for(i=0; i<N; ++i) {
    sumV += oneV / divV;
    divV += addV;
  }
  return horizontal_add(sumV);
}

When trying to compile the code above, I'm getting

g++ -Wall -Wextra -O3 -g -I include -fopenmp -m64 -mavx2 -mfma -std=c++17 -o harmonic_series harmonic_series.cpp
harmonic_series.cpp:87:40: error: user defined reduction not found for ‘sumV’
   87 |   #pragma omp parallel for reduction(+:sumV,divV)
      |                                        ^~~~
harmonic_series.cpp:87:45: error: user defined reduction not found for ‘divV’
   87 |   #pragma omp parallel for reduction(+:sumV,divV)

Any hints on how to solve this and provide the user-defined reduction for the Vec8d class? It's simply the plus operator which is defined by the VCL class, but I cannot find any example how to code this.

Thanks a lot for any help!


Solution

  • Here is the final solution. It's using automatic reduction and avoids u64->fp conversion by computing divV = startdivV + i * addV only in the first iteration of each loop and then using divV += addV for all other iterations. Runtime to compute the sum of first 9.6e10 elements is {real 9s, user 1m46s} with 12 threads on Intel Core i7-10850H CPU - it's the same as for the manual reduction.

    double HarmonicSeries(const unsigned long long int N) {
      unsigned long long int i;
      Vec8d divV(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0);
      Vec8d sumV(0.0);
      const Vec8d addV(8.0);
      const Vec8d oneV(1.0);
      const Vec8d startdivV = divV;
      bool first_loop = true;
      #pragma omp declare reduction( + : Vec8d : omp_out = omp_out + omp_in ) initializer (omp_priv=omp_orig)
    //It's important to mark "first_loop" variable as firstprivate so that each private copy gets initialized.
      #pragma omp parallel for firstprivate(first_loop) lastprivate(divV) reduction(+:sumV)
      for(i=0; i<N; ++i) {
        if (first_loop) {
          divV = startdivV + i * addV;
          first_loop = false;
        } else {
          divV += addV;
        }
        sumV += oneV / divV;
      }
      return horizontal_add(sumV);
    }