Search code examples
c++multithreadingconcurrencyppl

Concurrency::parallel_for (PPL) is creating too many threads


I'm using Concurrency::parallel_for() of Visual Studio 2010's Parallel Patterns Library (PPL) to process an indexed set of tasks (typically, the index set is much larger than the number of threads that can run simultaneously). Each task, before doing a lengthy calculation, starts by requesting a private working storage resource from a shared resource manager (in case: a view on a task specific memory mapped file, but I think the story-line would be the same if each task requested a private memory allocation from a shared heap).

The usage of the shared resource manager is synchronized with a Concurrency::critical_section and here the problem starts: If a first thread/task is in the critical section and a second task makes a request, it has to wait until the first task's request is handled. The PPL apparently then thinks: hey this thread is waiting and there are more tasks to do, hence another thread is created causing up to 870 threads mostly waiting at the same resource manager.

Now as handling the resource request is only a small part of the whole task, I would like to tell the PPL at that part to hold its horses, none of the waits or cooperative blocks should cause new threads to start from an indicated section of a working-thread and my question here is: if and how I can prevent a specific thread section to create new threads, even if it cooperatively blocks. I wouldn't mind new threads to be created at other blocks further down the thread's processing path, but no more than say 2* the number of (hyper)cores.

Alternatives that I have considered so far:

  1. Queue-up tasks and process the queue from a limited number of threads. Issue: I hoped, PPL's parallel_for would do that by itself.

  2. Define a Concurrency::combinable<Resource> resourceSet; outside the Concurrency::parallel_for and initialize resourceSet.local() once to reduce the number of resource requests (by reusing the resources) to the number of threads (which should be less than the number of tasks). Issue: this optimization doesn't prevent the superfluous thread creation.

  3. Pre allocate the required resources for each task outside the parallel_for loop. Issue: this would request too many system resources whereas limiting the amount of resources to the number of threads/cores would be OK (if that didn't explode).

I read http://msdn.microsoft.com/en-us/library/ff601930.aspx, section "Do Not Block Repeatedly in a Parallel Loop", but following the advice here would result in no parallel threads at all.


Solution

  • I do not know if it is possible to configure PPL/ConcRT to not use cooperative synchronization or at least to put the limit on the number of threads it creates. I thought it might be controlled via scheduler policies, but seemingly none of the policy parameters suits for the purpose.

    However I have some suggestions you might find useful to mitigate the problem, even if not in the ideal way:

    • Instead of critical_section, use a non-cooperative synchronization primitive to protect the resource manager. I think (though did not check) that the classical WinAPI CRITICAL_SECTION should succeed. As a radical step in this direction, you may consider other parallel libraries for your code; e.g. Intel's TBB provides most of PPL API and has more (disclaimer: I'm affiliated with it).

    • Pre-allocate a number of resources outside the parallel loop. One resource per task is not necessary; one per thread should be sufficient. Put these resources into a concurrent_queue, and inside a task pop a resource from the queue, use, and then push it back. Also, instead of returning the resource to the queue a thread might hoard it inside a combinable object for reuse in other tasks. If the queue happens to be empty (e.g. if PPL oversubscribes the machine), there might be different approaches, e.g. spinning in a loop until some other thread returns a resource, or requesting another resource from the manager. Also you may choose to pre-allocate more resources than the number of threads to minimize chances for resource exhaustion.