I'm trying to use Cython to parallelize an expensive operation which involves generating intermediate multidimensional arrays.
The following very simplified code illustrates the sort of thing I'm trying to do:
import numpy as np
cimport cython
cimport numpy as np
from cython.parallel cimport prange
from libc.stdlib cimport malloc, free
@cython.boundscheck(False)
@cython.wraparound(False)
def embarrasingly_parallel_example(char[:, :] A):
cdef unsigned int m = A.shape[0]
cdef unsigned int n = A.shape[1]
cdef np.ndarray[np.float64_t, ndim = 2] out = np.empty((m, m), np.float64)
cdef unsigned int ii, jj
cdef double[:, :] tmp
for ii in prange(m, nogil=True):
for jj in range(m):
# allocate a temporary array to hold the result of
# expensive_function_1
tmp_carray = <double * > malloc((n ** 2) * sizeof(double))
# a 2D typed memoryview onto tmp_carray
tmp = <double[:n, :n] > tmp_carray
# shove the intermediate result in tmp
expensive_function_1(A[ii, :], A[jj, :], tmp)
# get the final (scalar) output for this ii, jj
out[ii, jj] = expensive_function_2(tmp)
# free the intermediate array
free(tmp_carray)
return out
# some silly examples - the actual operation I'm performing is a lot more
# involved
# ------------------------------------------------------------------------
@cython.boundscheck(False)
@cython.wraparound(False)
cdef void expensive_function_1(char[:] x, char[:] y, double[:, :] tmp):
cdef unsigned int m = tmp.shape[0]
cdef unsigned int n = x.shape[0]
cdef unsigned int ii, jj
for ii in range(m):
for jj in range(m):
tmp[ii, jj] = 0
for kk in range(n):
tmp[ii, jj] += (x[kk] + y[kk]) * (ii - jj)
@cython.boundscheck(False)
@cython.wraparound(False)
cdef double expensive_function_2(double[:, :] tmp):
cdef unsigned int m = tmp.shape[0]
cdef unsigned int ii, jj
cdef double result = 0
for ii in range(m):
for jj in range(m):
result += tmp[ii, jj]
return result
There seems to be at least two reasons why this fails to compile:
Based on the output of cython -a
, the creation of the typed memory view here:
cdef double[:, :] tmp = <double[:n, :n] > tmp_carray
seems to involve Python API calls, and I therefore can't release the GIL to allow the outer loop to run in parallel.
I was under the impression that typed memory views were not Python objects, and therefore a child process ought to be able to create them without first acquiring the GIL. Is this the case?
2. Even if I replace prange(m, nogil=True)
with a normal range(m)
, Cython still doesn't seem to like the presence of a cdef
within the inner loop:
Error compiling Cython file:
------------------------------------------------------------
...
# allocate a temporary array to hold the result of
# expensive_function_1
tmp_carray = <double*> malloc((n ** 2) * sizeof(double))
# a 2D typed memoryview onto tmp_carray
cdef double[:, :] tmp = <double[:n, :n]> tmp_carray
^
------------------------------------------------------------
parallel_allocate.pyx:26:17: cdef statement not allowed here
It turns out that the second problem was easily solved by moving
cdef double[:, :] tmp
outside of the for
loop, and just assigning
tmp = <double[:n, :n] > tmp_carray
within the loop. I still don't fully understand why this is necessary, though.
Now if I try to use prange
I hit the following compilation error:
Error compiling Cython file:
------------------------------------------------------------
...
# allocate a temporary array to hold the result of
# expensive_function_1
tmp_carray = <double*> malloc((n ** 2) * sizeof(double))
# a 2D typed memoryview onto tmp_carray
tmp = <double[:n, :n]> tmp_carray
^
------------------------------------------------------------
parallel_allocate.pyx:28:16: Memoryview slices can only be shared in parallel sections
Disclaimer: Everything here is to be taken with a grain of salt. I'm more guessing that knowing. You should certainly ask the question on Cython-User. They are always friendly and fast to answer.
I agree that Cython's documentation is not very clear:
[...] memoryviews often do not need the GIL:
cpdef int sum3d(int[:, :, :] arr) nogil: ...
In particular, you do not need the GIL for memoryview indexing, slicing or transposing. Memoryviews require the GIL for the copy methods (C and Fortran contiguous copies), or when the dtype is object and an object element is read or written.
I think this means that passing a memory view parameter, or using it for slicing or transposing doesn't need Python GIL. However, creating a memoryview or copying one needs the GIL.
Another argument supporting this is that is is possible for a Cython function to return to Python a memory view.
from cython.view cimport array as cvarray
import numpy as np
def bla():
narr = np.arange(27, dtype=np.dtype("i")).reshape((3, 3, 3))
cdef int [:, :, :] narr_view = narr
return narr_view
Gives:
>>> import hello
>>> hello.bla()
<MemoryView of 'ndarray' at 0x1b03380>
which means that the memoryview is allocated in Python's GC managed memory and thus needs the GIL to be created. So you can't cant create a memoryview in a nogil section
Now for what concerns the error message
Memoryview slices can only be shared in parallel sections
I think you should read it as "You can't have a thread private memoryview slices. It must be a thread shared memoryview slices".