python recursion parallel-processing joblib

Why does recursively running `joblib.Parallel` increases the computation time?

What exactly happens internally when we run joblib.Parallel inside the function that we pass in delayed? Is it a good coding practice? Why does it increase the computation time? is it because of the overhead of managing(switching between) more parallel processes?

import time
from joblib import Parallel, delayed

def sqr(i):
    return i*i

def sqr_chunk(chunk): # processing chunks in parallel
    return Parallel(n_jobs=2)(delayed(sqr)(i) for i in chunk)

def sqr_sub_chunk(sub_chunk): # processing sub_chunks in parallel
    return Parallel(n_jobs=2)(delayed(sqr_chunk)(chunk) for chunk in sub_chunk)

def avg(l):
    s=0
    for i in l:
        s+=i
    return s/len(l)


l0, l1, l2 = [], [], []
for i in range(20):
    l = list(range(1000))
    t1 = time.time()
    result1 = Parallel(n_jobs=2)(delayed(sqr)(i) for i in l)
    t2 = time.time()
    l0+=[t2-t1]

    chunks = [list(range(i,i+100)) for i in range(0,1000,100)]
    t1 = time.time()
    result2 = Parallel(n_jobs=2)(delayed(sqr_chunk)(chunk) for chunk in chunks)
    t2 = time.time()
    l1+=[t2-t1]

    sub_chunks = [[i[:50],i[50:]] for i in chunks]
    t1 = time.time()
    result3 = Parallel(n_jobs=2)(delayed(sqr_sub_chunk)(sub_chunk) for sub_chunk in sub_chunks)
    t2 = time.time()
    l2+=[t2-t1]

print(avg(l0))
print(avg(l1))
print(avg(l2))

"""
output : 
0.058841276168823245
0.14938125610351563
0.10537683963775635
"""

Solution

The function Parallel of Joblib creates multiple jobs which can be either threads or processes depending on the backend used. The overheads are explained in the documentation:

By default joblib.Parallel uses the 'loky' backend module to start separate Python worker processes to execute tasks concurrently on separate CPUs. This is a reasonable default for generic Python programs but can induce a significant overhead as the input and output data need to be serialized in a queue for communication with the worker processes (see Serialization & Processes).

In your case, the chunks (including each integer) need to be serialized/unserialized (typically using pickle) which is particularly expensive, not to mention the inter-process communication overhead.

There are other backends, like the ones based on threads but they are limited by the GIL:

“threading” is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the called function relies a lot on Python objects. “threading” is mostly useful when the execution bottleneck is a compiled extension that explicitly releases the GIL (for instance a Cython loop wrapped in a “with nogil” block or an expensive call to a library such as NumPy).

In your case, the GIL is not released. Indeed, the GIL is needed for all pure-Python code (strong limitation of the CPython interpreter).

There is no other possible way to parallelize a pure-Python code. If you want a faster code, then I advise you to try Numpy. This should be actually much faster than a parallel Python code run with CPython because CPython is an (slow) interpreter (while Numpy functions are mostly native ones). It can also be combined with Numba and Cython (e.g. to use multiple threads without the GIL). Be aware that integers types of these modules are native so they have a limited/fixed size not more than 64-bits).