python numpy parallel-processing cython hpc

Parallelize in Cython without GIL

I'm trying to compute some columns of a numpy array, operating on python objects (numpy array) in a for loop using a cdef function.

I would like to do it in parallel. But not sure how to.

Here is a toy example, one def function calls a cdef function in a for loop using prange, which is not allowed because np.ndarray is a python object. In my real problem, one matrix and one vector are the arguments of the cdef function, and some numpy matrix operations are performed, like np.linalg.pinv() (which I guess is actually the bottleneck).

%%cython
import numpy as np
cimport numpy as np
from cython.parallel import prange
from c_functions import estimate_coef_linear_regression

DTYPE = np.float
ctypedef np.float_t DTYPE_t

def transpose_example(np.ndarray[DTYPE_t, ndim=2] data):
    """
    Transposes a matrix. It does each row independently and parallel
    """

    cdef Py_ssize_t n = data.shape[0]
    cdef Py_ssize_t t = data.shape[1]

    cdef np.ndarray[DTYPE_t, ndim = 2] results = np.zeros((t, n))

    cdef Py_ssize_t i

    for i in prange(n, nogil=True):
        results[i, :] = transpose_vector(data[:, i])

    return results

cdef transpose_vector(np.ndarray[DTYPE_t, ndim=1] vector):
    """
    transposes a np vector
    """
    return vector.transpose()

a = np.random.rand(100, 20)
transpose_example(a)

outputs

Converting to Python object not allowed without gil

What would be the best way to do this in parallel?

Solution

You can pass typed memoryview slices (cdef transpose_vector(DTYPE_t[:] vector)) around without the GIL - it's one of the key advantages of the newer typed memoryview syntax over np.ndarray.

However,

You can't call Numpy member functions (like transpose) on memoryviews, unless you cast back to a Numpy array (np.asarray(vector)). This requires the GIL.
Calling any kind of Python function (e.g. transpose) is going to require the GIL. This can be done inside a with gil: block, but when that block is almost your entire loop that becomes pretty pointless.
You don't specify a return type for transpose_vector, and so it'll default to object, which requires the GIL. You could specify a Cython return type, but I suspect even returning a memoryview slice may require some reference counting somewhere.
Be careful not to have multiple threads overwriting the same data in your passed memoryview slice.

In summary: memoryview slices, but bear in mind you're quite limited in what you can do without the GIL. Your current example just isn't parallelizable (but this may be mostly because it's a toy example).