Having thread-local arrays in cython so that I can resize them?

I have an interval-treeish algorithm I would like to run in parallel for many queries using threads. Problem is that then each thread would need its own array, since I cannot know in advance how many hits there will be.

There are other questions like this, and the solution suggested is always to have an array of size (K, t) where K is output length and t is number of threads. This does not work for me as K might be different for each thread and each thread might need to resize the array to fit all the results it gets.

Pseudocode:

for i in prange(len(starts)):

    qs, qe, qx = starts[i], ends[i], index[i]

    results = t.search(qs, qe)

    if len(results) + nfound < len(output):
        # add result to output
    else:
        # resize array
        # then add results

Solution

An usual pattern is that every thread gets its own container, which is a trade-off between speed/complexity and memory-overhead:

there is no need to lock for access to this container, because only one thread accesses it.
there is much less overhead compared to "own container for every task (i.e. every i-value)".

After the parallel section, the data must be either collected in a final container in a post processing step (which also could happen in parallel) or the subsequent algorithms should be able to handle a collection of containers.

Here is an example using c++-vector (which already has memory management and increasing size built-in):

%%cython -+ -c=/openmp --link-args=/openmp

from cython.parallel import prange, threadid
from libcpp.vector cimport vector
cimport openmp

def calc_in_parallel(N):    
    cdef int i,k,tid
    cdef int n = N
    cdef vector[vector[int]] vecs
    # every thread gets its own container
    vecs.resize(openmp.omp_get_max_threads())
    for i in prange(n, nogil=True):  
        tid = threadid()
        for k in range(i):
            # use container of the thread
            vecs[tid].push_back(k) # dummy for calculation

    return vecs

Using omp_get_max_threads() for the number of threads will overestimate the real number of threads in many cases. It is probably more robust to set the number of threads explicitly in prange, i.e.

...
NUM_THREADS = 2
vecs.resize(NUM_THREADS)
for i in prange(n, nogil=True, num_threads = NUM_THREADS): 
...

A similar approach can be applied using pure C, but more boiler plate code (memory management) will be needed in this case.