python scipy multiprocessing convolution joblib

How to parallelize scipy fftconvolve using joblib?

So I am filtering big images using scipy's fftconvolve, and I wanted to parallelize the different filterings I am doing for a single image. For the parallelization I wanted to use joblib. However, I am bugged by 2 results I have:

with a multiprocessing backend, the task is much slower (1.5 times slower)
with a multithreading backend, the task is faster (25% faster)

I am surprised by these 2 results as I was confident that the convolution was CPU-bound.

Here the code I used in a jupyter notebook to compute the runtimes:

from joblib import Parallel, delayed
import numpy as np
from scipy.signal import fftconvolve

im_size = (512, 512)
filter_size = tuple(s-1 for s in im_size)
n_filters = 3
image = np.random.rand(*im_size)
filters = [np.random.rand(*filter_size) for i in range(n_filters)]

%%timeit
s = np.sum(
    Parallel(n_jobs=n_filters, backend='multiprocessing')(
        delayed(fftconvolve)(image, f) for f in filters
    )
)

283 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
s = np.sum(
    Parallel(n_jobs=n_filters, backend='threading')(
        delayed(fftconvolve)(image, f) for f in filters
    )
)

142 ms ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
s = np.sum([fftconvolve(image, f) for f in filters])

198 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I also tried different things like having the image in a memmap, or reducing the pre-dispatched jobs, but nothing changed the results fundamentally.

Why isn't multiprocessing speeding up the computation when multithreading is?

Solution

The issue with benchmarking parallel processing is that you have to take into account the overhead caused in your code properly to be able to have the correct conclusion. There is 3 sources of overhead when using parallel processing:

Spawning threads or processes: this is something that is done each time you call Parallel, except is you rely on a managed Parallel object (with the with context) or when you use the loky backend. See here for more info.
Importing modules in fresh interpreters: For backend that relies on fresh processes (when the start method is not fork), there is a need to re-import all modules. This can cause an overhead.
Communication between the processes: when using processes (so not with backend=threading), you need to communicate arrays to each workers. The communication can slow down the computations, especially for short task with big inputs such as fftconvolve.

If your goal is to call this function a large number of times, you should modify your benchmark to actually remove the cost of spawning the workers for the Parallel object, either by using a managed Parallel object or by relying on this functionality for backend=loky. And avoid the overhead due to the loading of a modules:

from joblib import Parallel, delayed
import numpy as np
from scipy.signal import fftconvolve

from time import time, sleep


def start_processes(im, filter, mode=None, delay=0):
    sleep(delay)
    return im if im is not None else 0


def time_parallel(name, parallel, image, filters, n_rep=50):
        print(80*"=" + "\n" + name + "\n" + 80*"=")

        # Time to start the pool of workers and initialize the processes
        # With this first call, the processes/threads are actually started
        # and further calls will not incure this overhead anymore
        t0 = time()
        np.sum(parallel(
            delayed(start_processes)(image, f, mode='valid') for f in filters)
        )
        print(f"Pool init overhead: {(time() - t0) / 1e-3:.3f}ms")

        # Time the overhead due to loading of the scipy module
        # With this call, the scipy.signal module is loaded in the child
        # processes. This import can take up to 200ms for fresh interpreter.
        # This overhead is only present for the `loky` backend. For the
        # `multiprocessing` backend, as the processes are started with `fork`,
        # they already have a loaded scipy module. For the `threading` backend
        # and the iterative run, there no need to re-import the module so this
        # overhead is non-existent
        t0 = time()
        np.sum(parallel(
            delayed(fftconvolve)(image, f, mode='valid') for f in filters)
        )
        print(f"Library load overhead: {(time() - t0) / 1e-3:.3f}ms")

        # Average the runtime on multiple run, once the external overhead have
        # been taken into account.
        times = []
        for _ in range(n_rep):
            t0 = time()
            np.sum(parallel(
                delayed(fftconvolve)(image, f, mode='valid') for f in filters
            ))
            times.append(time() - t0)
        print(f"Runtime without init overhead: {np.mean(times) / 1e-3:.3f}ms,"
              f" (+-{np.std(times) / 1e-3:.3f}ms)\n")


# Setup the problem size
im_size = (512, 512)
filter_size = tuple(5 for s in im_size)
n_filters = 3
n_jobs = 3
n_rep = 50

# Generate random data
image = np.random.rand(*im_size)
filters = np.random.rand(n_filters, *filter_size)


# Time the `backend='multiprocessing'`
with Parallel(n_jobs=n_jobs, backend='multiprocessing') as parallel:
    time_parallel("Multiprocessing", parallel, image, filters, n_rep=n_rep)
sleep(.5)

# Time the `backend='threading'`
with Parallel(n_jobs=n_jobs, backend='threading') as parallel:
    time_parallel("Threading", parallel, image, filters, n_rep=n_rep)

sleep(.5)


# Time the `backend='loky'`.
# For this backend, there is no need to rely on a managed `Parallel` object
# as loky reuses the previously created pool by default. We will thus mimique
# the creation of a new `Parallel` object for each repetition
def parallel_loky(it):
    Parallel(n_jobs=n_jobs)(it)


time_parallel("Loky", parallel_loky, image, filters, n_rep=n_rep)
sleep(.5)


# Time the iterative run.
# We rely on the SequentialBackend of joblib which is used whenever `n_jobs=1`
# to allow using the same function. This should not change the computation
# much.
def parallel_iterative(it):
    Parallel(n_jobs=1)(it)


time_parallel("Iterative", parallel_iterative, image, filters, n_rep=n_rep)

$ python main.py 
================================================================================
Multiprocessing
================================================================================
Pool init overhead: 12.112ms
Library load overhead: 96.520ms
Runtime without init overhead: 77.548ms (+-16.119ms)

================================================================================
Threading
================================================================================
Pool init overhead: 11.887ms
Library load overhead: 76.858ms
Runtime without init overhead: 31.931ms (+-3.569ms)

================================================================================
Loky
================================================================================
Pool init overhead: 502.369ms
Library load overhead: 245.368ms
Runtime without init overhead: 44.808ms (+-4.074ms)

================================================================================
Iterative
================================================================================
Pool init overhead: 1.048ms
Library load overhead: 92.595ms
Runtime without init overhead: 47.749ms (+-4.081ms)

With this benchmark you can see that it is actually faster to use the loky backend once you have started it. But if you don't use it multiple times, the overhead is too large.