Search code examples
multithreadingnumpymemory-managementcythongil

Thread-local arrays in cython's prange without huge memory allocation


I have some independent computations I would like to do in parallel using Cython.

Right now I'm using this approach:

import numpy as np
cimport numpy as cnp
from cython.parallel import prange

[...]

cdef cnp.ndarray[cnp.float64_t, ndim=2] temporary_variable = \
    np.zeros((INPUT_SIZE, RESULT_SIZE), np.float64)
cdef cnp.ndarray[cnp.float64_t, ndim=2] result = \
    np.zeros((INPUT_SIZE, RESULT_SIZE), np.float64)

for i in prange(INPUT_SIZE, nogil=True):
    for j in range(RESULT_SIZE):
        [...]
        temporary_variable[i, j] = some_very_heavy_mathematics(my_input_array)
        result[i, j] = some_more_maths(temporary_variable[i, j])

This methodology works but my problem comes from the fact that I in fact need several temporary_variables. This results in huge memory usage when INPUT_SIZE grows. But I believe what is really needed is a temporary variable in each thread instead.

Am I facing a limitation of Cython's prange and do I need to learn proper C or am I doing/understanding something terribly wrong?

EDIT: The functions I was looking for were openmp.omp_get_max_threads() and openmp.omp_get_thread_num() to create a reasonably sized temporary array. I had to cimport openmp first.


Solution

  • This is something that Cython tries to detect, and actually gets right most of the time. If we take a more complete example code:

    import numpy as np
    from cython.parallel import prange
    
    cdef double f1(double[:,:] x, int i, int j) nogil:
        return 2*x[i,j]
    
    cdef double f2(double y) nogil:
        return y+10
    
    def example_function(double[:,:] arr_in):
        cdef double[:,:] result = np.zeros(arr_in.shape)
        cdef double temporary_variable
        cdef int i,j
        for i in prange(arr_in.shape[0], nogil=True):
            for j in range(arr_in.shape[1]):
                temporary_variable = f1(arr_in,i,j)
                result[i,j] = f2(temporary_variable)
        return result
    

    (this is basically the same as yours, but compilable). This compiles to the C code:

    #pragma omp for firstprivate(__pyx_v_i) lastprivate(__pyx_v_i) lastprivate(__pyx_v_j) lastprivate(__pyx_v_temporary_variable)
                    #endif /* _OPENMP */
                    for (__pyx_t_8 = 0; __pyx_t_8 < __pyx_t_9; __pyx_t_8++){
    

    You can see that temporary_variable is set to be thread-local. If Cython does not detect this correctly (I find it's often too keen to make variables a reduction) then my suggestion is to encapsulate (some of) the contents of the loop in a function:

    cdef double loop_contents(double[:,:] arr_in, int i, int j) nogil:
        cdef double temporary_variable
        temporary_variable = f1(arr_in,i,j)
        return f2(temporary_variable)
    

    Doing so forces temporary_variable to be local to the function (and hence to the thread)


    With respect to creating a thread-local array: I'm not 100% clear exactly what you want to do but I'll try to take a guess...

    1. I don't believe it's possible to create a thread-local memoryview.
    2. You could create a thread-local C array with malloc and free but unless you have a good understanding of C then I would not recommend it.
    3. The easiest way is to allocate a 2D array where you have one column for each thread. The array is shared, but since each thread only touches its own column that doesn't matter. A simple example:

      cdef double[:] f1(double[:,:] x, int i) nogil:
          return x[i,:]
      
      def example_function(double[:,:] arr_in):
          cdef double[:,:] temporary_variable = np.zeros((arr_in.shape[1],openmp.omp_get_max_threads()))
          cdef int i
          for i in prange(arr_in.shape[0],nogil=True):
              temporary_variable[:,openmp.omp_get_thread_num()] = f1(arr_in,i)