multithreading numpy memory-management cython gil

Thread-local arrays in cython's prange without huge memory allocation

I have some independent computations I would like to do in parallel using Cython.

Right now I'm using this approach:

import numpy as np
cimport numpy as cnp
from cython.parallel import prange

[...]

cdef cnp.ndarray[cnp.float64_t, ndim=2] temporary_variable = \
    np.zeros((INPUT_SIZE, RESULT_SIZE), np.float64)
cdef cnp.ndarray[cnp.float64_t, ndim=2] result = \
    np.zeros((INPUT_SIZE, RESULT_SIZE), np.float64)

for i in prange(INPUT_SIZE, nogil=True):
    for j in range(RESULT_SIZE):
        [...]
        temporary_variable[i, j] = some_very_heavy_mathematics(my_input_array)
        result[i, j] = some_more_maths(temporary_variable[i, j])

This methodology works but my problem comes from the fact that I in fact need several temporary_variables. This results in huge memory usage when INPUT_SIZE grows. But I believe what is really needed is a temporary variable in each thread instead.

Am I facing a limitation of Cython's prange and do I need to learn proper C or am I doing/understanding something terribly wrong?

EDIT: The functions I was looking for were openmp.omp_get_max_threads() and openmp.omp_get_thread_num() to create a reasonably sized temporary array. I had to cimport openmp first.

Solution

This is something that Cython tries to detect, and actually gets right most of the time. If we take a more complete example code:

import numpy as np
from cython.parallel import prange

cdef double f1(double[:,:] x, int i, int j) nogil:
    return 2*x[i,j]

cdef double f2(double y) nogil:
    return y+10

def example_function(double[:,:] arr_in):
    cdef double[:,:] result = np.zeros(arr_in.shape)
    cdef double temporary_variable
    cdef int i,j
    for i in prange(arr_in.shape[0], nogil=True):
        for j in range(arr_in.shape[1]):
            temporary_variable = f1(arr_in,i,j)
            result[i,j] = f2(temporary_variable)
    return result

(this is basically the same as yours, but compilable). This compiles to the C code:

#pragma omp for firstprivate(__pyx_v_i) lastprivate(__pyx_v_i) lastprivate(__pyx_v_j) lastprivate(__pyx_v_temporary_variable)
                #endif /* _OPENMP */
                for (__pyx_t_8 = 0; __pyx_t_8 < __pyx_t_9; __pyx_t_8++){

You can see that temporary_variable is set to be thread-local. If Cython does not detect this correctly (I find it's often too keen to make variables a reduction) then my suggestion is to encapsulate (some of) the contents of the loop in a function:

cdef double loop_contents(double[:,:] arr_in, int i, int j) nogil:
    cdef double temporary_variable
    temporary_variable = f1(arr_in,i,j)
    return f2(temporary_variable)

Doing so forces temporary_variable to be local to the function (and hence to the thread)

With respect to creating a thread-local array: I'm not 100% clear exactly what you want to do but I'll try to take a guess...

I don't believe it's possible to create a thread-local memoryview.
You could create a thread-local C array with malloc and free but unless you have a good understanding of C then I would not recommend it.

The easiest way is to allocate a 2D array where you have one column for each thread. The array is shared, but since each thread only touches its own column that doesn't matter. A simple example:

cdef double[:] f1(double[:,:] x, int i) nogil:
    return x[i,:]

def example_function(double[:,:] arr_in):
    cdef double[:,:] temporary_variable = np.zeros((arr_in.shape[1],openmp.omp_get_max_threads()))
    cdef int i
    for i in prange(arr_in.shape[0],nogil=True):
        temporary_variable[:,openmp.omp_get_thread_num()] = f1(arr_in,i)