numpy parallel-processing cython intel-mkl

Is there any way, the speed of the following numpy code can be increased, may be by parallelizing?

I am writing an application which requires very low latency. The application will be running on intel Xenon processor enabled with mkl-dnn instructions/AVX instructions set. The following code is taking 22 milliseconds when executed on intel 9750H processor.

def func(A,B):
    result = 0
    for ind in range(len(B)):
        index = (A[:,0] <= B[ind,0]) & (A[:,1] <= B[ind,1]) & (A[:,2] <= B[ind,2])
        result += ((A[index,3].sum()) * B[ind,3])
        A = A[~index]
    return result

%timeit func(A,B)
21.5 ms ± 509 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Is there a way to improve the code so that execution time decreases. Anything lesser than 5 milliseconds would be great. By the way, the matrix A has a shape of (80000 x 4) and matrix B has a shape of (32 x 4). Both of them are sorted on first three columns. Can we parallelize any component, the application can use 16 cores.

Solution

Instead of your function use:

def func2(A, B):
    x = np.zeros(A.shape[0], dtype=int)
    for bInd in range(len(B)):
        x[np.where(x, False, np.all(A[:, 0:3] <= B[bInd, 0:3], axis=1))] = B[bInd, 3]
    return (A[:, 3] * x).sum()

The speed gain is smaller than what you expect. Using A of shape (10, 4) and B of shape (4, 4), I got execution time by 15 % shorter than for your function.

But maybe on bigger source arrays the speed gain will be more apparent. Try on your own.