python optimization integration scientific-computing quad

Parallelisation/Optimisation of integrals in Python

I need to compute a large set of INDEPENDENT integrals and I have currently written the following code for it in Python:

# We define the integrand
def integrand(tau,k,t,z,config, epsilon=1e-7):
    u = np.sqrt(np.maximum(epsilon,tau**2 - z**2))
    return np.sin(config.omega * (t - tau))*k/2, np.sin(config.omega * (t - tau)) * j1(k * u) / u


NON-VECTORISED
partial_integral = np.zeros((len(n_values),len(t_values),len(z_values)))

for n in range(1, len(n_values)): # We skip the n=0 case as it is trivially = 0
    for j in range(1, len(z_values)): # We skip the z=0 case as it is trivially = 0
        for i in range(1, len(t_values)): # We skip the t=0 case as it is trivially = 0
            partial_integral[n,i,j],_ = quad(integrand, x_min[n,i,j], x_max[n,i,j], args=(k_n_values[n],t_values[i],z_values[j],config), limit=np.inf) # We use quad

Now, all of len(n_values),len(t_values),len(z_values) are big numbers, so I would like to accelerate the code as much as possible. Any suggestions?

Apart from playing with different integration libraries (among which Scipy.quad seems the best), I have thought about vectorising the code:

# VECTORISED
    def compute_integral(n,i,j):
        quad(integrand, x_min[n,i,j], x_max[n,i,j], args=(k_n_values[n],t_values[i],z_values[j],config, epsilon), limit=10000) # We use quad
    
    # Use np.meshgrid to create the index grids for n, i, j (starting from 1 to avoid 0-index)
    n_grid, i_grid, j_grid = np.meshgrid(np.arange(0, len(n_values)), np.arange(0, len(t_values)), np.arange(0, len(z_values)), indexing='ij')

    # Flatten the grids to vectorize the loop over n, i, j
    indices = np.vstack([n_grid.ravel(), i_grid.ravel(), j_grid.ravel()]).T

    # Vectorize the integral computation using np.vectorize
    vectorized_integral = np.vectorize(lambda n, i, j: compute_integral(n, i, j))

    # Apply the vectorized function to all combinations of (n, i, j)
    partial_integral = np.empty((len(n_values),len(t_values),len(z_values)))
    partial_integral[tuple(indices.T)] = vectorized_integral(*indices.T)

But it is not clear that this wins me much...

I have also tried to use Numba (with numba-scipy for the j1) to JIT the integrand in the hope of gaining a bit of performance, and indeed this has served me to get a x5 performance! To do this I just changed my function to the following:

from numba import njit # Remember to also have numba-scipy installed!!!!!

# We define the integrand and JIT it with Numba (and numba-scipy) for a faster performance
    @njit
    def integrand(tau,k,t,z,omega, epsilon):
        u = np.sqrt(np.maximum(0,tau**2 - z**2))
        if u < epsilon:
            return np.sin(omega * (t - tau))*k/2
        else:
            return np.sin(omega * (t - tau)) * j1(k * u) / u

PS: As an extra, I am currently running this script in my PC, but I would probably be able to run it on a cluster.

Full code:

import os
import numpy as np
from scipy.special import j1
from scipy.integrate import quad
from numba import njit # Remember to also have numba-scipy installed!!!!!

from tqdm import tqdm


def perform_integrals(config):
    '''
    We perform the integrals using quad
    '''

    # We store the range of n, kn and gn in arrays of length N_max
    n_values = np.linspace(0, config.N_max-1, config.N_max, dtype=int)
    k_n_values = 2 * np.pi * n_values

    # We store the range of t, z in the arrays of dimension (N_t) and (N_z)
    t_values = np.linspace(0, config.N_t*config.delta_t, config.N_t)
    z_values = np.linspace(0, config.N_z*config.delta_z, config.N_z)

    # Preallocate the result arrays (shape: len(n_values) x len(z_values))
    x_min = np.zeros((len(t_values), len(z_values)))
    x_max = np.empty((len(t_values), len(z_values)))

    # Compute the values
    t1_values = np.roll(t_values, 1)
    t1_values[0] = 0.
    x_min = np.maximum(z_values[None, :], t1_values[:, None])  # Max between z_values[j] and t_values[i-1]
    x_max = np.maximum(z_values[None, :], t_values[:, None])  # Max between z_values[j] and t_values[i]


    # We define the integrand and JIT it with Numba (and numba-scipy) for a faster performance
    @njit
    def integrand(tau,k,t,z,omega, epsilon=1e-7):
        u = np.sqrt(np.maximum(0,tau**2 - z**2))
        if u < epsilon:
            return np.sin(omega * (t - tau))*k/2
        else:
            return np.sin(omega * (t - tau)) * j1(k * u) / u
        


    # NON-VECTORISED
    partial_integral = np.zeros((len(n_values),len(t_values),len(z_values)))

    for n in tqdm(range(1, len(n_values))): # We skip the n=0 case as it is trivially = 0
        for i in range(1, len(t_values)): # We skip the t=0 case as it is trivially = 0
            for j in range(1, len(z_values)): # We skip the z=0 case as it is trivially = 0
                partial_integral[n,i,j],_ = quad(integrand, x_min[i,j], x_max[i,j], args=(k_n_values[n],t_values[i],z_values[j],config.omega, 1e-7), limit=10000, epsabs=1e-7, epsrel=1e-4) # We use quad


    return partial_integral

class TalbotConfig:
    def __init__(self):
        self.A = 1. # Amplitude of signal
        self.c = 1. # Speed of light
        self.d = 1. # Distance between gratings we fix it = 1
        self._lambda = self.d / 10. # Wavelength
        self.w = 2 * self._lambda # Width of the gratings

        # Other relevant magnitudes
        self.omega = 2 * np.pi * self.c / self._lambda # Frequency of the signal
        self.z_T = self._lambda/(1. - np.sqrt(1.-(self._lambda/self.d) ** 2)) # Talbot distance = 2 d^2/λ

        # Simulation parameters
        self.N_x = 27*2 -1 # Number of samples in x direction
        self.N_z = 192*2 -1 # Number of samples in z direction
        self.N_t = 100 -1 # Number of samples in time
        self.N_max = int(self.d / self._lambda * 4) # Number of terms in the series

        # Other relevant magnitudes
        self.last_t_zT = 1. # Final time / Z_t
        self.delta_t = self.z_T/self.c/self.N_t * self.last_t_zT # Time between photos
        self.delta_x = self.d/2/self.N_x # X-Distance between points
        self.delta_z = self.z_T/self.N_z # Z-Distance between points


    def __str__(self):
        params = {
            "Amplitude of signal (A)": self.A,
            "Speed of light (c)": self.c,
            "Distance between gratings (d)": self.d,
            "Wavelength (lambda)": self._lambda,
            "Width of the gratings (w)": self.w,
            "Frequency of the signal (omega)": self.omega,
            "Talbot distance (z_T)": self.z_T,
            "Number of samples in x direction (N_x)": self.N_x,
            "Number of samples in z direction (N_z)": self.N_z,
            "Number of samples in time (N_t)": self.N_t,
            "Number of terms in the series (N_max)": self.N_max,
            "Time between photos (delta_t)": self.delta_t,
            "X-Distance between points (delta_x)": self.delta_x,
            "Z-Distance between points (delta_z)": self.delta_z,
            "Final time / z_T": self.last_t_zT
        }
        
        print("{:<45} {:<40}".format('\nParameter', 'Value'))
        print("-" * 65)
        for key, value in params.items():
            print("{:<45} {:<40}".format(key, value))
        return ""
    
    def debugging(self):
        if self.Debugging:
            # Simulation parameters
            self.N_x = 5 # Number of samples in x direction
            self.N_z = 5 # Number of samples in z direction
            self.N_t = 5 # Number of samples in time
            self.N_max = int(self.d / self._lambda)*2 # Number of terms in the series

        return

if __name__ == "__main__":
    config = TalbotConfig()

    # Are we debugging?
    config.Debugging = False
    config.debugging()

    print(config)

 

    integral = perform_integrals(config)

Solution

A significant fraction of the time is spent in the Bessel function J1 and this function is relatively well optimized when numba-scipy is installed and the integrand function is compiled with numba (i.e. no need to call it from the slow CPython interpreter). That being said, in practice, this function is implemented in the Cephes library which is apparently not the fastest one. The GSL seems to be a bit faster. Still, using the GSL from Python should be slow too. Using it from Numba is possible but a bit cumbersome (the library file needs to be manually loaded with modules like ctypes/cffi and then you need to define the function prototype before calling it).

Moreover, the quad function of Scipy seems to spend about 20~30% of the time in CPython overheads (e.g. calling CPython functions, checking CPython types, converting them, etc.).

These overheads are usually reduced with scipy.LowLevelCallable. However, this is pretty hard to use it here since you need many arguments. Indeed, this requires the arguments to be wrapped in the LowLevelCallable objects as a ctypes pointer to a Numpy/ctypes array and also the items to be modified every time in the innermost loop. Setting the items takes some time due to the slow interpreter and inefficient scalar Numpy access... On top of that, the pointer of type void* passed to the integrand function needs to be converted to a double* and this turns out to be surprisingly complicated in Numba... I tried to do all of that and in the and I got a frustrating "segmentation fault" and found meanwhile weird Numba bugs... I do not recommend this solution at all.

An alternative solution is simply to rewrite this in a native language like C or C++, typically using the GSL (which turns out to be pretty fast for this kind of operation). Only the nested loops needs to be rewritten. The main downside of this solution is that all Python variables required for the computation need to be provided to the native function and wrapped as native types (which is a bit tedious since there are quite a lot of arguments needed). That being said, the native C++ code can be trivially and efficiently parallelized with OpenMP and it is much faster!

Here is the C++ code computing the nested loops:

#include <cstdlib>
#include <cstdint>
#include <cstdio>
#include <cmath>
#include <gsl/gsl_integration.h>
#include <gsl/gsl_sf_bessel.h>

double integrand(double tau, void* generic_params)
{
    const double* params = (double*)generic_params;
    const double k = params[0];
    const double t = params[1];
    const double z = params[2];
    const double omega = params[3];
    const double epsilon = params[4];

    const double u = sqrt(std::max(0.0, tau*tau - z*z));

    if(u < epsilon)
        return sin(omega * (t - tau)) * k * 0.5;
    else
        return sin(omega * (t - tau)) * gsl_sf_bessel_J1(k * u) / u;
}

extern "C" void compute(double* partial_integral, double* x_min, double* x_max, 
                        double* k_n_values, double* t_values, double* z_values, 
                        int n_size, int t_size, int z_size, 
                        int limit, double epsabs, double epsrel, 
                        double omega, double epsilon)
{
    #pragma omp parallel for collapse(2) schedule(dynamic)
    for (int n = 1; n < n_size; ++n)
    {
        for (int t = 1; t < t_size; ++t)
        {
            gsl_integration_workspace* workspace = gsl_integration_workspace_alloc(1024*1024);

            double params[5];
            double err [[maybe_unused]];

            gsl_function func;
            func.function = &integrand;
            func.params = &params;

            for (int z = 1; z < z_size; ++z)
            {
                params[0] = k_n_values[n];
                params[1] = t_values[t];
                params[2] = z_values[z];
                params[3] = omega;
                params[4] = epsilon;

                gsl_integration_qags(&func, x_min[t*z_size+z], x_max[t*z_size+z], 
                                     epsabs, epsrel, limit, workspace, 
                                     &partial_integral[(n*t_size+t)*z_size+z], &err);
            }

            gsl_integration_workspace_free(workspace);
        }
    }
}

Here is the Python code to load the C++ library and its main function:

import ctypes

# Load the compiled library with ctypes
kernel = ctypes.CDLL('./libkernel.so')
int_t = ctypes.c_int
double_t = ctypes.c_double
ptr_t = ctypes.POINTER(double_t)
kernel.compute.argtypes = [
    ptr_t, ptr_t, ptr_t, ptr_t, ptr_t, ptr_t,
    int_t, int_t, int_t, int_t,
    double_t, double_t, double_t, double_t
]
kernel.compute.restype = None

# Utility function
ptr = lambda array: array.ctypes.data_as(ptr_t)

Here is the Python code to call the C++ function (instead of the nested Python loop):

kernel.compute(
    ptr(partial_integral), ptr(x_min), ptr(x_max), 
    ptr(k_n_values), ptr(t_values), ptr(z_values), 
    len(n_values), len(t_values), len(z_values), 
    10000, 1e-7, 1e-4, 
    config.omega, 1e-7
)

You can compile this code on Linux with g++ -O3 kernel.cpp -lgsl --shared -fPIC -o libkernel.so -fopenmp. You may need to install On Windows, the standard compiler is MSVC, but it is certainly simpler to use Clang-CL for this (so you need to change the executable name of the previous command). Moreover, the name of the output library needs to be changed to kernel.dll (and so the name of the loaded file in the ctypes section of the Python code). For Mac, using Clang is also good idea too (AFAIK the output file should be kernel.dylib).

On my i5-9600KF CPU (6 cores) on Linux, the optimized code is 17 times faster! It only takes 3.3 seconds!

If for some reasons you cannot use a native language or do not want to, then I think the only simple solution is to parallelize the code with libraries like joblib (or possibly multiprocessing). This will clearly not be as fast as the native version but it should scale well as long as the Numba function is cached (using the cache=True compilation flag). Otherwise, the function will certainly be recompiled in each worker process (pretty expensive). The native version should be about 3 times faster than this Python version.

Distributed computing with MPI

PS: As an extra, I am currently running this script in my PC, but I would probably be able to run it on a cluster.

The standard way to do that in the scientific community is to use MPI. For Python codes, there is the mpi4py module. You can also use it in C/C++ quite easily. MPI needs to be installed and setup on the cluster (which is the case on nearly all clusters used for scientific computations on nearly all supercomputers nowadays).

To support MPI, you need to first initialize/finalize it in the Python script (at the beginning/end) and then you can adapt the C++ function so to use MPI. Each MPI process can compute a part of the n loop (you can compute the range to compute thanks to MPI_Comm_rank and MPI_Comm_size functions). Each process can compute a slice of the partial_integral array and can then send it to the master process (the one with the rank 0). Note you need to send all the arrays to the other processes using MPI_Bcast (or to recompute them in each process). Regarding OpenMP, you can move the #pragma to the next t-based loop (since the former now distributed with MPI), while taking care to remove the collapse(2). You should also take care of spawning 1 MPI process per node (so to avoid creating far too many threads). This MPI code can scale up to n_size nodes (i.e. ~40 here) and using all the available cores on each nodes. Alternatively, you can use 1 MPI process per core and not use OpenMP at all, but this requires the t-based loop to be also distributed for the code to scale (which is more complex than just using OpenMP).