python performance numpy numba dot-product

How to make two arrays contiguous so that Numba can speed up np.dot()

I have the following code:

import numpy as np
from numba import jit

Nx = 15
Ny = 1000

v = np.ones((Nx,Ny))
v = np.reshape(v,(Nx*Ny))
A = np.random.rand(Nx*Ny,Nx*Ny,5)
B = np.random.rand(Nx*Ny,Nx*Ny,5)
C = np.random.rand(Nx*Ny,5)
   
@jit(nopython=True)
def dotplus(B, v, C):
    return np.dot(B, v) + C

k = 2
D = dotplus(B[:,:,k], v, C[:,k])

I get the following warning, which I guess refers to arrays B[:,:,k] and v:

NumbaPerformanceWarning: np.dot() is faster on contiguous arrays, called on (array(float64, 2d, A), array(float64, 1d, C))
  return np.dot(B, v0) + C

Is there a way to make the two arrays contiguous, so that Numba can speed up the code?

PS in case you're wondering about the meaning of k, note this is just a MRE. In the actual code, dotplus is called multiple times inside a for loop for different values of k (so, different slices of B and C). The for loop updates the values of v, but B and C don't change.

Solution

Flawr is correct. B[..., k] returns a np.view() into B, but does not actually copy any data. In memory, two neighbouring elements of the view have a distance of B.strides[1], which evaluates to B.shape[-1]*B.itemsize and is greater than B.itemsize. Consequentially, your array is not contiguous.

The best optimization is to vectorize the dotplus loop and write

D = np.tensordot(B, v, axes=(1, 0)) + C

The second best optimization is to refactor and let the batch dimension be the first dimension of the array. This can be done on top of the above vectorization and is generally advisable. It would look something like

A = np.random.rand(5, Nx*Ny,Nx*Ny)
# rather than
A = np.random.rand(Nx*Ny,Nx*Ny,5)

If you can't refactor the code, you need to start profiling. You can easily swap axes temporarily via

B = np.moveaxis(B, -1, 0)
some_op(B[k, ...], ...)
B = np.moveaxis(B, 0, -1)

Contrary to max9111's comment, this will not net you anything compared to np.ascontiguousarray() because the data has to be copied in both cases. That said, a copy is O(Nx*Ny*k) + buffer allocation. Direct matrix-vector multiplication is O(Nx*Ny), but you have to gather the elements first, which is really expensive. It comes down to your specific architecture and concrete problem, so profiling is the way to go.