Search code examples
c++multithreadingclangopenmp

custom omp reduction on std::map's


So the problem I came across was more or less the following:

I have a relatively large set of data, which contains always a pair of identifier and a value related to it. Hereby, there exist relatively few distinct, but arbitrary, identifiers.

In c++ this might look like an std::vector<std::pair<size_t, double> >.

I now want to generate an std::map, which tells us the sum over all values for each of the identifiers, so in this case std::map<size_t, double>.

So for an input of

std::vector<std::pair<size_t, double>> typeDoubleVec{{1, 2.}, {3, 4.}, {1, 3.}, {3, 5.}, {2, 1.}};

I would like a map equal to:

std::map<size_t, double> result{{1, 5.}, {2, 1.}, {3, 9.}}

A function that does this job would look like the following. Hereby, the second input vector specifies which identifiers exist:

std::map<size_t, double> foo(const std::vector<std::pair<size_t, double>> &typeDoubleVec, const std::vector<size_t> &typeVec) {
  // create the map that contains our return values.
  std::map<size_t, double> mymap;

  // ensure that mymap contains all identifiers.
  for (auto &elem : typeVec) {
    mymap[elem];
  }

  // iterate over the elements
  for (size_t i = 0; i < typeDoubleVec.size(); ++i) {
    mymap.at(typeDoubleVec[i].first) += typeDoubleVec[i].second;
  }
  return mymap;
}

Does anyone know how to speed this up with OpenMP? The way I see this working is that you need a custom OpenMP reduction?


Solution

  • So I came across the answer myself: There are two ways to do this, one with a custom reduction and the second with a critical section. Out of which I would currently recommend the latter, mainly because the former is broken on the current clang compiler (v9.0.0, fix is already in the trunk/master).

    OpenMP parallel for with custom reduction

    The first way to solve the problem is to use an OpenMP reduction, which normally looks like this:

    // this does not work for maps!
    #pragma omp parallel for reduction(+: mymap)
    

    This code snipped will not compile, as the built-in reduction + is not defined for an std::map.

    Instead we will have to define our own reduction. A quick look into some OpenMP specification (https://www.openmp.org/spec-html/5.1/openmpsu117.html#x152-1790002.21.5.7) reveals the following syntax for defining custom reductions:

    #pragma omp declare reduction(reduction-identifier : typename-list : 
    combiner) [initializer-clause] newline
    
    • reduction-identifier: this is a name we can specify for our custom reduction.
    • typename-list: this is a list of type-names for which this reduction is defined. For us this is std::map<size_t, double>
    • combiner: this is the expression that does the actual reduction. It takes omp_in and omp_out as input and should store the combined result in omp_out. For the simple + reduction, this is omp_out += omp_in.
    • initializer-clause: this is an optional expression that should have the form initializer(expression). If it is missing, the thread-local copies of the reduction variables are default initialized. If it is existent, the expression must have the form omp_priv = initializer or omp_priv = function-name(argument-list). It might also use omp_orig, which corresponds to the initial value of the reduction variable.
    • a newline is required at the end of the pragma.

    In this case, we want to add the values of two maps with the same keys. This can be done in a function like this:

    void mapAdd(std::map<size_t, double> &inout, std::map<size_t, double> &in) {
      for (auto initer = in.begin(), outiter = inout.begin(); initer != in.end(); ++initer, ++outiter) {
        outiter->second += initer->second;
      }
    }
    

    As previously described, the thread-local variables are normally default initialized. For an std::map the default initialization is, however, not desired. Instead, each thread-local variable should be initialized with the already existing map. This can be specified inside of the initializer and our pragma thus looks as follows:

    #pragma omp declare reduction(                             \
                                  mapAdd :                     \
                                  std::map<size_t, double> :   \
                                  mapAdd(omp_out, omp_in)      \
                                 )                             \
                        initializer (omp_priv=omp_orig)
    

    and it can be used for the omp parallel for from above:

    #pragma omp parallel for reduction(mapAdd : mymap)
      for (size_t i = 0; i < typeDoubleVec.size(); ++i) {
        mymap.at(typeDoubleVec[i].first) += typeDoubleVec[i].second;
      }
    

    The downside

    This does not work with the current clang compiler. It was compiling just fine, but it produced a segmentation fault and after digging around a bit I discovered that this was not my fault, but instead a compiler bug, as the same program worked find on the current gcc and intel compiler.

    Additionally, the clang compiler has problems (undefined reference) when an OpenMP reduction is declared inside of a template function, as it does not instantiate all functions needed inside of the OpenMP reduction.

    See also the following issues:

    Using lambdas inside of the custom reduction

    The use of lambdas inside of custom OpenMP reductions is not specified in the standard (according to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60228). I can therefore not recommended to do this. It does work however, with the current intel compiler, and will work with the next release of the clang compiler (either 9.0.1 or 10). GCC does not yet support it (see: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60228).

    Using critical sections

    One way to circumvent the need for reductions is to reproduce them on a very basic level, i.e., we create a local copy for each thread and then accumulate the results manually inside of a critical section. This has the advantage, that it is a bit easier to read, but can be a slower solution than the one through custom reductions, as no fan-in is implemented.

    A solution with this approach would look like this:

    template <class vectype, class typevectype>
    std::map<size_t, double> foo(const vectype &typeDoubleVec,
                                 const typevectype &typevec) {
      std::map<size_t, double> mymap;
    
      // ensure that mymap has all elements of myvec.
      for (auto &elem : typevec) {
    Does anyone know how to speed this up with OpenMP? The way I see this working is that you need a custom OpenMP reduction?
        mymap[elem];
      }
    
    #pragma omp parallel default(none) shared(typeDoubleVec, mymap)
      {
        // we create a local copy of mymap for each thread!
        auto localmap = mymap;
    #pragma omp for
        // iterate over the vector and add them to the local map
        for (size_t i = 0; i < typeDoubleVec.size(); ++i) {
          localmap.at(typeDoubleVec[i].first) += typeDoubleVec[i].second;
        }
    #pragma omp critical
        // sum up the local map to the global map using a critical section
        mapAdd(mymap, localmap);
      }
      return mymap;
    }
    

    The code

    The code using the custom reductions can be found at https://gist.github.com/SteffenSeckler/404c214bcccf506d261264672e2b9341

    The code using the critical section at https://gist.github.com/SteffenSeckler/91943b881677f3cbe7b2d7d475471ee8

    P.S.

    Thanks for the feedback on splitting this up to Q+A