python multiprocessing python-multiprocessing

why is more than one worker used in `multiprocessing.Pool().apply_async()`?

Problem

From the multiprocessing.Pool docs:

apply_async(func ...): A variant of the apply() method which returns a result object. ...

Reading further ...

apply(func[, args[, kwds]]): Call func with arguments args and keyword arguments kwds. It blocks until the result is ready. Given this blocks, apply_async() is better suited for performing work in parallel. Additionally, func is only executed in one of the workers of the pool.

The last bold line suggests only one worker from a pool is used. I find this is only true under certain conditions.

Given

Here is code that executes Pool.apply_async() in three similar cases. In all cases, the process id is printed.

import os
import time
import multiprocessing as mp


def blocking_func(x, delay=0.1):
    """Return a squared argument after some delay."""
    time.sleep(delay)                                  # toggle comment here
    return x*x, os.getpid()


def apply_async():
    """Return a list applying func to one process with a callback."""
    pool = mp.Pool()
    # Case 1: From the docs
    results = [pool.apply_async(os.getpid, ()) for _ in range(10)]
    results = [res.get(timeout=1) for res in results]
    print("Docs    :", results)

    # Case 2: With delay
    results = [pool.apply_async(blocking_func, args=(i,)) for i in range(10)]
    results = [res.get(timeout=1)[1] for res in results]
    print("Delay   :", results)

    # Case 3: Without delay
    results = [pool.apply_async(blocking_func, args=(i, 0)) for i in range(10)]
    results = [res.get(timeout=1)[1] for res in results]
    print("No delay:", results)

    pool.close()
    pool.join()


if __name__ == '__main__':
    apply_async()

Results

The example from the docs (Case 1) confirms only one worker is run. We extend this example in the next cases by applying blocking_func, which blocks with some delay.

Commenting the time.sleep() line in blocking_func() brings all cases in agreement.

# Time commented
# 1. Docs    : [8208, 8208, 8208, 8208, 8208, 8208, 8208, 8208, 8208, 8208]
# 2. Delay   : [8208, 8208, 8208, 8208, 8208, 8208, 8208, 8208, 8208, 8208]
# 3. No delay: [8208, 8208, 8208, 8208, 8208, 8208, 8208, 8208, 8208, 8208]

Each call to apply_async() creates a new process pool, which is why new process ids differ from the latter.

# Time uncommented
# 1. Docs    : [6780, 6780, 6780, 6780, 6780, 6780, 6780, 6780, 6780, 6780]
# 2. Delay   : [6780, 2112, 6780, 2112, 6780, 2112, 6780, 2112, 6780, 2112]
# 3. No delay: [6780, 2112, 6780, 2112, 6780, 2112, 6780, 2112, 6780, 2112]

However when time.sleep() is uncommented, even with zero delay, more than one worker is used.

In short, uncommented we expect one worker as in Case 1, but we get multiple workers as in Cases 2 and 3.

Question

Although I expect only one worker to be used by Pool().apply_async(), why are more than one used when time.sleep() is uncommented? Should blocking even effect the number of workers used by apply or apply_async?

Note: previous, related questions ask "why is only one worker used?" This question asks the opposite - "why isn't only one worker used?" I'm using 2 cores on a Windows machine.

Solution

Your confusion seems to come from thinking [pool.apply_async(...) for i in range(10)] is one call, when there are really ten independent calls. A call to any pool-method is a "job". A job generally can lead to one or multiple tasks being distributed. apply-methods always produce only a single task under the hood. A task is an indivisible unit of work which will be received as a whole by a random pool-worker.

There's only one shared inqueue, all workers are fed over. Which idling worker will be woken up from waiting to get() a task from that queue is up to the OS. Your result-entropy for case 1 is still somewhat surprising and probably very lucky, at least unless you confirm you have only two cores.

And yes, your observation for this run is also influenced by the computation time needed for a task, since threads (the scheduled execution unit within a process) usually are scheduled with time slicing policies (e.g. ~20ms for Windows).