Contiguous array warning in Python (numba)

I have the following snippet of code where I've used numba in order to speed up my code:

import numpy as np
from numba import jit

Sigma = np.array([
                  [1, 1, 0.5, 0.25],
                  [1, 2.5, 1, 0.5],
                  [0.5, 1, 0.5, 0.25],
                  [0.25, 0.5, 0.25, 0.25]
])
Z = np.array([0.111, 0.00658])

@jit(nopython=True)
def mean(Sigma, Z):
  return np.dot(np.dot(Sigma[0:2, 2:4], linalg.inv(Sigma[2:4, 2:4])), Z)

print(mean(Sigma, Z))

However, numba is complaining

NumbaPerformanceWarning: np.dot() is faster on contiguous arrays, called on (array(float64, 2d, A), array(float64, 2d, F))
  return np.dot(np.dot(Sigma[0:2, 2:4], linalg.inv(Sigma[2:4, 2:4])), Z)

If I'm not mistaken (after reading this), the contiguous structure of the numpy arrays is broken due to the slicing of sub-matrices from Sigma (i.e., "Sigma[0:2, 2:4]"). Is this correct? If so is there any way to resolve this warning? I believe resolving this warning would help speed up my code which is my main goal. Thanks.

Solution

You get this error because dot and inv are optimized for contiguous arrays. However, regarding the small input size, this is not a huge problem. Still, you can at least specify that the input array are contiguous using the signature 'float64[:](float64[:,::1], float64[::1])' in the decorator @jit(...). This also cause the function to be compiled eagerly.

The biggest performance issue in this function is the creation of few temporary array and the call to linalg.inv which is not designed to be fast for very small matrices. The inverse matrix can be obtained by computing a simple expression based on its determinant.

Here is the resulting code:

import numba as nb

@nb.njit('float64[:](float64[:,::1], float64[::1])')
def fast_mean(Sigma, Z):
    # Compute the inverse matrix
    mat_a = Sigma[2, 2]
    mat_b = Sigma[2, 3]
    mat_c = Sigma[3, 2]
    mat_d = Sigma[3, 3]
    invDet = 1.0 / (mat_a*mat_d - mat_b*mat_c)
    inv_a = invDet * mat_d
    inv_b = -invDet * mat_b
    inv_c = -invDet * mat_c
    inv_d = invDet * mat_a

    # Compute the matrix multiplication
    mat_a = Sigma[0, 2]
    mat_b = Sigma[0, 3]
    mat_c = Sigma[1, 2]
    mat_d = Sigma[1, 3]
    tmp_a = mat_a*inv_a + mat_b*inv_c
    tmp_b = mat_a*inv_b + mat_b*inv_d
    tmp_c = mat_c*inv_a + mat_d*inv_c
    tmp_d = mat_c*inv_b + mat_d*inv_d

    # Final dot product
    z0, z1 = Z
    result = np.empty(2, dtype=np.float64)
    result[0] = tmp_a*z0 + tmp_b*z1
    result[1] = tmp_c*z0 + tmp_d*z1
    return result

This is about 3 times faster on my machine. Note that >60% of the time is spend in the overhead of calling the Numba function and creating the output temporary array. Thus, it is probably wise to use Numba in the caller functions so to remove this overhead.

You can pass the result array as parameter so to avoid the creation of the array which is quite expensive as pointed out by @max9111. This is useful only if you could preallocate the output buffer in the caller functions (once if possible). This is nearly 6 times faster with this.