Does scipy.optimize.minimize use parallelization?

scipy.optimize.minimize function with method="BFGS" (based on this) doesn't seem to use parallelization when computing the cost function or numerical gradient. However, when I run an optimization on a Macbook Air 8 core Apple M1 (see below for minimal reproducible example), using top command I get 750% to 790% CPU usage, suggesting all 8 cores are used. This isn't always the case. On a supercomputer where each node has 40 cores I got 100% to 200% CPU usage, suggesting only 2 cores are used.

Does scipy.optimize.minimize use parallelization when computing the cost function/numerical gradient?
If so, how do I get scipy.optimize.minimize to use all cores?

# basic optimization of the variational functional for a random symmetric matrix

import numpy as np
from numpy.random import uniform
from scipy.optimize import minimize

# generate random symmetric matrix to compute minimal eigenvalue of
N = 1000
H = uniform(-1, 1, [N,N])
H = H + H.T

# variational cost function
def cost(x):
    return (x @ H @ x) / (x @ x)

x0 = uniform(-0.1, 0.1, N)

# minimize variational function with BFGS
minimize(cost, x0, method='BFGS')

Solution

No, but the function being evaluated can use parallelization.

You might think that you're not using parallelization in this program. And you're not - at least not explicitly. However, many NumPy operations call out to your platform's BLAS library. Matrix multiplication is one of the operations that can be parallelized by BLAS.

Some profiling shows that this program spends roughly 80% of its time doing matrix multiplies inside of the cost() function.

You can check this possibility using the library threadpoolctl.

Example:

from threadpoolctl import ThreadpoolController
controller = ThreadpoolController()
for i in range(1, 5):
    t0 = time.time()
    with controller.limit(limits=i, user_api='blas'):
        print(minimize(cost, x0, method='BFGS'))
    t = time.time()
    print(f"Threads {i}, Duration: {t - t0:.3f}")

By using htop, I confirmed that restricting BLAS parallelism also restricts the number of cores this program uses.

Under the tests I did, this is not parallelizing particularly well. This suggests that most of the extra CPU usage is being wasted.

Threads 1, Duration: 4.561
Threads 2, Duration: 3.631
Threads 3, Duration: 3.462
Threads 4, Duration: 4.019

(Results are only a rough guide, and will depend on your particular CPU and BLAS library. Benchmark this on your own hardware.)

Note: Although 80% of the time is spent inside your cost function, SciPy also seems to be using some BLAS parallelism, as moving the with controller.limit(limits=i, user_api='blas'): line inside the cost function resulted in some amount of parallelism. Most likely, this is from inverting the Hessian, which is the most expensive step of BFGS, ignoring computing the cost function itself.

Note: One of the reasons this is so slow is that no Jacobian is provided for this function. If no Jacobian is provided, it must be estimated numerically by calling the function cost() once for every dimension in the problem. In this case, it requires an extra thousand calls for each step.