Search code examples
c++openmpreduction

OpenMP reduction on container elements


I have a nested loop, with few outer, and many inner iterations. In the inner loop, I need to calculate a sum, so I want to use an OpenMP reduction. The outer loop is on a container, so the reduction is supposed to happen on an element of that container. Here's a minimal contrived example:

#include <omp.h>
#include <vector>
#include <iostream>

int main(){
    constexpr int n { 128 };

    std::vector<int> vec (4, 0);
    for (unsigned int i {0}; i<vec.size(); ++i){

        /* this does not work */
        //#pragma omp parallel for reduction (+:vec[i])
        //for (int j=0; j<n; ++j)
        //  vec[i] +=j;

        /* this works */
        int* val { &vec[0] };
        #pragma omp parallel for reduction (+:val[i])
        for (int j=0; j<n; ++j)
            val[i] +=j;

        /* this is allowed, but looks very wrong. Produces wrong results
         * for std::vector, but on an Eigen type, it worked. */
        #pragma omp parallel for reduction (+:val[i])
        for (int j=0; j<n; ++j)
            vec[i] +=j;
    }
    for (unsigned int i=0; i<vec.size(); ++i) std::cout << vec[i] << " ";
    std::cout << "\n";

    return 0;
}

The problem is, that if I write the reduction clause as (+:vec[i]), I get the error ‘vec’ does not have pointer or array type, which is descriptive enough to find a workaround. However, that means I have to introduce a new variable and somewhat change the code logic, and I find it less obvious to see what the code is supposed to do.

My main question is, whether there is a better/cleaner/more standard way to write a reduction for container elements.

I'd also like to know why and how the third way shown in the code above somewhat works. I'm actually working with the Eigen library, on whose containers that variant seems to work just fine (haven't extensively tested it though), but on std::vector, it produces results somewhere between zero and the actual result (8128). I thought it should work, because vec[i] and val[i] should both evaluate to dereferencing the same address. But alas, apparently not.

I'm using OpenMP 4.5 and gcc 9.3.0.


Solution

  • I'll answer your question in three parts:

    1. What is the best way to perform to OpenMP reductions in your example above with a std::vec ?

    i) Use your approach, i.e. create a pointer int* val { &vec[0] };

    ii) Declare a new shared variable like @1201ProgramAlarm answered.

    iii) declare a user defined reduction (which is not really applicable in your simple case, but see 3. below for a more efficient pattern).

    2. Why doesn't the third loop work and why does it work with Eigen ?

    Like the previous answer states you are telling OpenMP to perform a reduction sum on a memory address X, but you are performing additions on memory address Y, which means that the reduction declaration is ignored and your addition is subjected to the usual thread race conditions.

    You don't really provide much detail into your Eigen venture, but here are some possible explanations:

    i) You're not really using multiple threads (check n = Eigen::nbThreads( ))

    ii) You didn't disable Eigen's own parallelism which can disrupt your own usage of OpenMP, e.g. EIGEN_DONT_PARALLELIZE compiler directive.

    iii) The race condition is there, but you're not seeing it because Eigen operations take longer, you're using a low number of threads and only writing a low number of values => lower occurrence of threads interfering with each other to produce the wrong result.

    3. How should I parallelize this scenario using OpenMP (technically not a question you asked explicitly) ?

    Instead of parallelizing only the inner loop, you should parallelize both at the same time. The less serial code you have, the better. In this scenario each thread has its own private copy of the vec vector, which gets reduced after all the elements have been summed by their respective thread. This solution is optimal for your presented example, but might run into RAM problems if you're using a very large vector and very many threads (or have very limited RAM).

    #pragma omp parallel for collapse(2) reduction(vsum : vec)
    for (unsigned int i {0}; i<vec.size(); ++i){
        for (int j = 0; j < n; ++j) {
            vec[i] += j;
        }
    }
    

    where vsum is a user defined reduction, i.e.

    #pragma omp declare reduction(vsum : std::vector<int> : std::transform(omp_out.begin(), omp_out.end(), omp_in.begin(), omp_out.begin(), std::plus<int>())) initializer(omp_priv = decltype(omp_orig)(omp_orig.size()))
    

    Declare the reduction before the function where you use it, and you'll be good to go