I'd like to run tbb::parallel_for over a large data set and generate a unique set. There is some additional logic contained within the parallel_for body that determines whether each sub-element of the original data set should be included in this set. The resulting set is typically much smaller than the original data set and I would rather not compute a vector with duplicates and remove the duplicates after as this will increase the memory usage.
My first implementation uses tbb::concurrent_unordered_set and through profiling I notice a significant performance bottleneck in the set.insert() method.
My attempt to improve this is to try and use thread-local storage to produce a set per-thread then combine the sets at the end to remove the atomics.
Despite reading a lot of the documentation, I remain unsure as to whether tbb::combinable or tbb::enumerable_thread_specific is best suited.
This must be a fairly common use case. Could someone provide an example implementation or point me at an example online I can look at?
I think you are going to the right direction. Concurrent hash tables are efficient for big number of elements (thousands). Though you can still try to reserve enough capacity prior to running your algorithm and play with load factor of the concurrent_unordered_set
(set to 1) and try concurrent_hash_map
as well (it is faster when using insert(value)
without accessor, it also needs to reserve some capacity).
Both tbb::combinable
and tbb::enumerable_thread_specific
use the same backend implementation. The difference is in interface only. The documentation has the example for the latter, I restyled it a bit:
typedef tbb::enumerable_thread_specific< std::pair<int,int> > CounterType;
CounterType MyCounters (std::make_pair(0,0));
int main() {
tbb::parallel_for( tbb::blocked_range<int>(0, 100000000),
[](const tbb::blocked_range<int> &r) {
CounterType::reference my_counter = MyCounters.local();
++my_counter.first; my_counter.second += r.size();
});
std::pair<int,int> sum = MyCounters.combine(
[](std::pair<int,int> x, std::pair<int,int> y) {
return std::make_pair(x.first+y.first, x.second+y.second);
});
printf("Total calls to operator() = %d, "
"total iterations = %d\n", sum.first, sum.second);
}
And finally, try also alternative approach, use tbb::parallel_reduce
where you don't need additional means like combinable, moreover the reduction is done mostly in parallel (only log P sequential steps while combining thread specific values requires sequential visiting of all the P elements).