Threading Building Blocks (TBB) library provides two functions for performing reduction over a range:
Which one of two shall be selected if I want to perform the reduction as fast as possible, but still get exactly the same answer independently on hardware concurrency and the load from other processes or threads? I am basically interested in two scenarios:
And a side question. On the page about parallel_deterministic_reduce
there is one warning:
Since
simple_partitioner
does not automatically coarsen ranges, make sure to specify an appropriate grain size
Does it mean that the call to parallel_deterministic_reduce
with a range having no explicitly specified grain size will lead to poor performance? How grain size shall be set then?
parallel_reduce
does not make any guarantees regarding the summation order. If used with floating point numbers, the result is not deterministic since summation of floating point numbers is not associative.
In contrast, parallel_deterministic_reduce
guarantees that the summation order is always the same, regardless of the number of threads used. But note that there is still no guarantee of a specific summation order, just that the order is deterministic (for example, the result can differ compared to std::accumulate
).
Thus:
parallel_reduce
for best performance.parallel_deterministic_reduce
if you need deterministic behavior.Regarding the note that simple_partitioner
does not automatically coarsen ranges: I am not entirely sure why they mention this specifically in the documentation of parallel_deterministic_reduce
. In all cases where you use simple_partitioner
, you should think about appropriate grain sizes. Not just in case of parallel_deterministic_reduce
. A too small grain size can lead to a large overhead. See for example here. In practice this especially means to measure the performance for typical work loads, and play with the grain size or partitioner such that performance is maximized.
I guess they just wanted to highlight the general issue again for parallel_deterministic_reduce
because parallel_deterministic_reduce
supports only simple_partitioner
and static_partitioner
, but neither affinity_partitioner
nor auto_partitioner
(where the latter is usually the default).