So the problem I came across was more or less the following:
I have a relatively large set of data, which contains always a pair of identifier and a value related to it. Hereby, there exist relatively few distinct, but arbitrary, identifiers.
In c++ this might look like an std::vector<std::pair<size_t, double> >
.
I now want to generate an std::map
, which tells us the sum over all values for each of the identifiers, so in this case std::map<size_t, double>
.
So for an input of
std::vector<std::pair<size_t, double>> typeDoubleVec{{1, 2.}, {3, 4.}, {1, 3.}, {3, 5.}, {2, 1.}};
I would like a map equal to:
std::map<size_t, double> result{{1, 5.}, {2, 1.}, {3, 9.}}
A function that does this job would look like the following. Hereby, the second input vector specifies which identifiers exist:
std::map<size_t, double> foo(const std::vector<std::pair<size_t, double>> &typeDoubleVec, const std::vector<size_t> &typeVec) {
// create the map that contains our return values.
std::map<size_t, double> mymap;
// ensure that mymap contains all identifiers.
for (auto &elem : typeVec) {
mymap[elem];
}
// iterate over the elements
for (size_t i = 0; i < typeDoubleVec.size(); ++i) {
mymap.at(typeDoubleVec[i].first) += typeDoubleVec[i].second;
}
return mymap;
}
Does anyone know how to speed this up with OpenMP? The way I see this working is that you need a custom OpenMP reduction?
So I came across the answer myself: There are two ways to do this, one with a custom reduction and the second with a critical section. Out of which I would currently recommend the latter, mainly because the former is broken on the current clang compiler (v9.0.0, fix is already in the trunk/master).
The first way to solve the problem is to use an OpenMP reduction, which normally looks like this:
// this does not work for maps!
#pragma omp parallel for reduction(+: mymap)
This code snipped will not compile, as the built-in reduction +
is not defined for an std::map
.
Instead we will have to define our own reduction. A quick look into some OpenMP specification (https://www.openmp.org/spec-html/5.1/openmpsu117.html#x152-1790002.21.5.7) reveals the following syntax for defining custom reductions:
#pragma omp declare reduction(reduction-identifier : typename-list :
combiner) [initializer-clause] newline
std::map<size_t, double>
omp_in
and omp_out
as input and should store the combined result in omp_out
. For the simple +
reduction, this is omp_out += omp_in
.initializer(expression)
. If it is missing, the thread-local copies of the reduction variables are default initialized. If it is existent, the expression must have the form omp_priv = initializer
or omp_priv = function-name(argument-list)
. It might also use omp_orig
, which corresponds to the initial value of the reduction variable.In this case, we want to add the values of two maps with the same keys. This can be done in a function like this:
void mapAdd(std::map<size_t, double> &inout, std::map<size_t, double> &in) {
for (auto initer = in.begin(), outiter = inout.begin(); initer != in.end(); ++initer, ++outiter) {
outiter->second += initer->second;
}
}
As previously described, the thread-local variables are normally default initialized.
For an std::map
the default initialization is, however, not desired.
Instead, each thread-local variable should be initialized with the already existing map.
This can be specified inside of the initializer and our pragma thus looks as follows:
#pragma omp declare reduction( \
mapAdd : \
std::map<size_t, double> : \
mapAdd(omp_out, omp_in) \
) \
initializer (omp_priv=omp_orig)
and it can be used for the omp parallel for
from above:
#pragma omp parallel for reduction(mapAdd : mymap)
for (size_t i = 0; i < typeDoubleVec.size(); ++i) {
mymap.at(typeDoubleVec[i].first) += typeDoubleVec[i].second;
}
This does not work with the current clang compiler. It was compiling just fine, but it produced a segmentation fault and after digging around a bit I discovered that this was not my fault, but instead a compiler bug, as the same program worked find on the current gcc and intel compiler.
Additionally, the clang compiler has problems (undefined reference) when an OpenMP reduction is declared inside of a template function, as it does not instantiate all functions needed inside of the OpenMP reduction.
See also the following issues:
The use of lambdas inside of custom OpenMP reductions is not specified in the standard (according to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60228). I can therefore not recommended to do this. It does work however, with the current intel compiler, and will work with the next release of the clang compiler (either 9.0.1 or 10). GCC does not yet support it (see: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60228).
One way to circumvent the need for reductions is to reproduce them on a very basic level, i.e., we create a local copy for each thread and then accumulate the results manually inside of a critical section. This has the advantage, that it is a bit easier to read, but can be a slower solution than the one through custom reductions, as no fan-in is implemented.
A solution with this approach would look like this:
template <class vectype, class typevectype>
std::map<size_t, double> foo(const vectype &typeDoubleVec,
const typevectype &typevec) {
std::map<size_t, double> mymap;
// ensure that mymap has all elements of myvec.
for (auto &elem : typevec) {
Does anyone know how to speed this up with OpenMP? The way I see this working is that you need a custom OpenMP reduction?
mymap[elem];
}
#pragma omp parallel default(none) shared(typeDoubleVec, mymap)
{
// we create a local copy of mymap for each thread!
auto localmap = mymap;
#pragma omp for
// iterate over the vector and add them to the local map
for (size_t i = 0; i < typeDoubleVec.size(); ++i) {
localmap.at(typeDoubleVec[i].first) += typeDoubleVec[i].second;
}
#pragma omp critical
// sum up the local map to the global map using a critical section
mapAdd(mymap, localmap);
}
return mymap;
}
The code using the custom reductions can be found at https://gist.github.com/SteffenSeckler/404c214bcccf506d261264672e2b9341
The code using the critical section at https://gist.github.com/SteffenSeckler/91943b881677f3cbe7b2d7d475471ee8
Thanks for the feedback on splitting this up to Q+A