python numpy parallel-processing cython thread-local-storage

Allocate intermediate multidimensional arrays in Cython without acquiring the GIL

I'm trying to use Cython to parallelize an expensive operation which involves generating intermediate multidimensional arrays.

The following very simplified code illustrates the sort of thing I'm trying to do:

import numpy as np
cimport cython
cimport numpy as np
from cython.parallel cimport prange
from libc.stdlib cimport malloc, free


@cython.boundscheck(False)
@cython.wraparound(False)
def embarrasingly_parallel_example(char[:, :] A):

    cdef unsigned int m = A.shape[0]
    cdef unsigned int n = A.shape[1]
    cdef np.ndarray[np.float64_t, ndim = 2] out = np.empty((m, m), np.float64)
    cdef unsigned int ii, jj
    cdef double[:, :] tmp

    for ii in prange(m, nogil=True):
        for jj in range(m):

            # allocate a temporary array to hold the result of
            # expensive_function_1
            tmp_carray = <double * > malloc((n ** 2) * sizeof(double))

            # a 2D typed memoryview onto tmp_carray
            tmp = <double[:n, :n] > tmp_carray

            # shove the intermediate result in tmp
            expensive_function_1(A[ii, :], A[jj, :], tmp)

            # get the final (scalar) output for this ii, jj
            out[ii, jj] = expensive_function_2(tmp)

            # free the intermediate array
            free(tmp_carray)

    return out


# some silly examples - the actual operation I'm performing is a lot more
# involved
# ------------------------------------------------------------------------
@cython.boundscheck(False)
@cython.wraparound(False)
cdef void expensive_function_1(char[:] x, char[:] y, double[:, :] tmp):

    cdef unsigned int m = tmp.shape[0]
    cdef unsigned int n = x.shape[0]
    cdef unsigned int ii, jj

    for ii in range(m):
        for jj in range(m):
            tmp[ii, jj] = 0
            for kk in range(n):
                tmp[ii, jj] += (x[kk] + y[kk]) * (ii - jj)


@cython.boundscheck(False)
@cython.wraparound(False)
cdef double expensive_function_2(double[:, :] tmp):

    cdef unsigned int m = tmp.shape[0]
    cdef unsigned int ii, jj
    cdef double result = 0

    for ii in range(m):
        for jj in range(m):
            result += tmp[ii, jj]

    return result

There seems to be at least two reasons why this fails to compile:

Based on the output of cython -a, the creation of the typed memory view here:
```
cdef double[:, :] tmp = <double[:n, :n] > tmp_carray
```
seems to involve Python API calls, and I therefore can't release the GIL to allow the outer loop to run in parallel.

I was under the impression that typed memory views were not Python objects, and therefore a child process ought to be able to create them without first acquiring the GIL. Is this the case?

~~2. Even if I replace prange(m, nogil=True) with a normal range(m), Cython still doesn't seem to like the presence of a cdef within the inner loop:~~

    Error compiling Cython file:
    ------------------------------------------------------------
    ...
                # allocate a temporary array to hold the result of
                # expensive_function_1
                tmp_carray = <double*> malloc((n ** 2) * sizeof(double))

                # a 2D typed memoryview onto tmp_carray
                cdef double[:, :] tmp = <double[:n, :n]> tmp_carray
                    ^
    ------------------------------------------------------------

    parallel_allocate.pyx:26:17: cdef statement not allowed here

Update

It turns out that the second problem was easily solved by moving

 cdef double[:, :] tmp

outside of the for loop, and just assigning

 tmp = <double[:n, :n] > tmp_carray

within the loop. I still don't fully understand why this is necessary, though.

Now if I try to use prange I hit the following compilation error:

Error compiling Cython file:
------------------------------------------------------------
...
            # allocate a temporary array to hold the result of
            # expensive_function_1
            tmp_carray = <double*> malloc((n ** 2) * sizeof(double))

            # a 2D typed memoryview onto tmp_carray
            tmp = <double[:n, :n]> tmp_carray
               ^
------------------------------------------------------------

parallel_allocate.pyx:28:16: Memoryview slices can only be shared in parallel sections

Solution

Disclaimer: Everything here is to be taken with a grain of salt. I'm more guessing that knowing. You should certainly ask the question on Cython-User. They are always friendly and fast to answer.

I agree that Cython's documentation is not very clear:

[...] memoryviews often do not need the GIL:

cpdef int sum3d(int[:, :, :] arr) nogil: ...

In particular, you do not need the GIL for memoryview indexing, slicing or transposing. Memoryviews require the GIL for the copy methods (C and Fortran contiguous copies), or when the dtype is object and an object element is read or written.

I think this means that passing a memory view parameter, or using it for slicing or transposing doesn't need Python GIL. However, creating a memoryview or copying one needs the GIL.

Another argument supporting this is that is is possible for a Cython function to return to Python a memory view.

from cython.view cimport array as cvarray
import numpy as np

def bla():
    narr = np.arange(27, dtype=np.dtype("i")).reshape((3, 3, 3))
    cdef int [:, :, :] narr_view = narr
    return narr_view

Gives:

>>> import hello
>>> hello.bla()
<MemoryView of 'ndarray' at 0x1b03380>

which means that the memoryview is allocated in Python's GC managed memory and thus needs the GIL to be created. So you can't cant create a memoryview in a nogil section

Now for what concerns the error message

Memoryview slices can only be shared in parallel sections

I think you should read it as "You can't have a thread private memoryview slices. It must be a thread shared memoryview slices".