multithreading parallel-processing intel parallelism-amdahl tbb-flow-graph

Intel TBB flowgraph overhead

Here is my attempt to benchmark the performance of Intel TBB flow graph. Here is the setup:

One broadcast node sending continue_msg to N successor nodes
( a broadcast_node<continue_msg> )
Each successor node performs a computation that takes t seconds.
The total computation time when performed serially is Tserial = N * t
The ideal computation time, if all cores are used, is Tpar(ideal) = N * t / C, where C is the number of cores.
The speedup is defined as Tpar(actual) / Tserial
I tested the code with gcc5 on a 16 core PC.

Here are the results showing the speedup as a function of the processing time of individually task ( i.e. body ):

t = 100 microsecond  ,   speed-up =  14
t  = 10 microsecond  ,   speed-up =   7
t  =  1 microsecond  ,   speed-up =   1

As can been for lightweight tasks ( whose computation takes less than 1 microseconds ), the parallel code is actually slower that the serial code.

Here are my questions:

1 ) Are these results inline with Intel TBB benchmarks?
2 ) It there a better paradigm, than a flow graph for the case, when there are thousands of tasks each taking less than 1 microsecond?

Solution

The Overhead of Parallelism

Your cost model is wrong.

The ideal parallel computation time is:

Tpar(ideal) = N * t / C + Pstart + Pend

where Pstart is how long it takes to start your parallelism and Pend is the time taken to finish it. It's not unusual for Pstart to be on the order of tens of milliseconds.

I'm not sure if you're familiar with OpenMP (though it's a good thing to know), but, for our purposes it's a threading model that divides work between a team of threads. The following chart shows the overhead of some of the commands in relation to the size of the thread team:

The takeaway is that getting your parallelism going (the parallel for line) is potentially expensive and grows with the number of threads. Ending parallelism (the barrier line) has similar costs.

In fact, if you look at TBB's tutorial, Section 3.2.2 ("Automatic Chunking") says:

CAUTION: Typically a loop needs to take at least a million clock cycles for parallel_for to improve its performance. For example, a loop that takes at least 500 microseconds on a 2 GHz processor might benefit from parallel_for.

Achieving Better Speed

The best way to speed up code like this is to only perform the operations in parallel if there are a large number of them and/or to adjust the number of threads doing the work so that each thread has plenty to do. In TBB you could achieve similar behaviour like so:

#include <tbb/parallel_for.h>

// . . .
if(work_size>1000)
  tbb::serial::parallel_for( . . . );
else
  tbb::parallel_for( . . . );
// . . .

where you'd want to adjust 1000 to a number high enough that you start to see gains from parallelism.

You could also reduce the number of threads, since this reduces the overhead somewhat:

tbb::task_scheduler_init init(nthread);

TBB also performs dynamic load balancing (see here). This is great if you expect your loop iterations/tasks to have a broad distribution of run-times, but represents unnecessary overhead if the expected run-times are the same. I'm not sure if TBB has static scheduling, but it may be worth looking into.

In case people end up here without a strong commitment to TBB, in OpenMP you'd do something like:

#pragma omp parallel for if(work_size>1000) num_threads(4) schedule(static)