python multiprocessing python-multiprocessing

Why is Pool slower than the same number of Processes

I recently tried refactoring some parallel processes into a pool and was surprised that the pool took almost twice as long as pure processes. Please assume they are running on the same machine with the same number of cores. I hope that someone can explain why my implementation using Pool is taking longer and perhaps offer some advice:

Shared dependency:

https://github.com/taynaud/python-louvain

from community import best_partition

Here is the faster implementation using Process. [UPDATE] refactored to control the number of active processes same as the Pool implementation, still faster:

processes = []
pipes = []

def _get_partition(send_end):
    send_end.send(best_partition(a_graph, resolution=res, randomize=rand))

for idx in range(iterations):
    recv_end, send_end = Pipe(False)
    p = Process(target=_get_partition, args=(send_end,))
    processes.append(p)
    pipes.append(recv_end)

running_procs = []
finished_procs = []
while len(finished_procs) < iterations:
    while len(running_procs) < n_cores and len(processes):
        proc = processes.pop()
        proc.start()
        running_procs.append(proc)

    for idx, proc in enumerate(running_procs):
        if not proc.is_alive():
            finished_procs.append(running_procs.pop(idx))

for p in finished_procs:
    p.join()

partitions = [pipe.recv() for pipe in pipes]

And here is the slower, Pool implementation. This is still slower no matter how many processes the pool is given:

pool = Pool(processes=n_cores)
results = [
    pool.apply_async(
        best_partition,
        (a_graph,),
        dict(resolution=res, randomize=rand)
    ) for i in range(iterations)
]
partitions = [res.get() for res in results]
pool.close()
pool.join()

Solution

Usually when there is a difference between a pool and bunch of processes (it can be for the benefit of either), it is your data set and task performed that define the outcome.

Without knowing what your a_graph is, I make a wild guess it is something big. In your process model, you rely on the in-memory copy of this in your subprocesses. In your Pool model, you transmit a copy of a_graph as an argument to each worker every time one is called. This is in practice implemented as a queue. In your process model, your subprocess gets a copy of this at C level when Python interpreter calls fork(). This is much faster than transmitting a large Python object, dictionary, array or whatever it is, via a queue.

The reverse would be true if tasks took only a minuscule time to complete. In this case, Pool is the better performing solution, as Pool passes tasks to already running processes. Processes do not need to be recreated after each task. In this case, the overhead needed to create a lot of new processes that run only a fraction of a second, slows the process implementation down.

As I said, this is pure speculation, but in your examples there is a significant difference what you actually transmit as a parameter to your workers, and that might be the explanation.