cuda kernel nvidia dynamic-memory-allocation memory-reallocation

CUDA - dynamically reallocate more global memory in Kernel

I have a question about the following task:

"Given a two-dimensional array "a[N][M]" so N lines of length M. Each element of the array contains an random integer value between 0 and 16. Write a kernel "compact(int *a, int *listM, int *listN)" that consists of only one block of N threads, and each thread counts for one line of the array how many elements have a value of 16.

The threads write these numbers into an array "num" of length N in shared memory, and then (after a barrier) one of the threads executes the prefix code "PrefixSum(int *num, int N)" listed below (In the code below i explain, what this code does). Finally (again barrier), each thread "Idx" writes the N- and M-values, respectively positions, (or "x- and y-coordinates") of the elements of its row that have a value of 16 into two arrays "listM" and "listN" in global memory, starting at the position "num[Idx]" in these arrays. In order to realize this last task more easily, there is the prefix code mentioned above."

I've written a kernel and a suitable main to test it. However, I still have a problem that I can not solve.

In the two arrays "listeM" and "listeN", the individual positions of each 16 occurring in the array "a[M][N]" should be stored. Therefore, their size must be equal to the total number of occurrences of 16, which may vary.

Since you do not know the exact number of elements with the value 16, you only know at runtime of the kernel how much memory is needed for the two arrays "listeM" and "listeN". Of course you could just release enough memory for the maximum possible number at program start, namely N times M, but that would be very inefficient. Is it possible to write the kernel so that every single thread dynamically enlarges the two arrays "listeM" and "listeN" after counting the number of elements with the value 16 in its row (just this number)?

Here is my Kernel:

__global__ void compact(int* a, int* listM, int* listN)
{
    int Idx = threadIdx.x;
    int elements, i;

    i = elements = 0;

    __shared__ int num[N];

    for (i = 0; i < M; i++)
    {
        if (a[Idx][i] == 16)
        {
            elements++;
        }
    }
    num[Idx] = elements;

        //Here at this point, the thread knows the number of elements with the value 16 of its line and would 
        //need to allocate just as much extra memory in "listeM" and "listeN". Is that possible ?

    __syncthreads();

    if (Idx == 0)
    {
                //This function sets the value of each element in the array "num" to the total value of the 
                //elements previously counted in all lines with the value 16.
                //Example: Input: num{2,4,3,1} Output: num{0,2,6,9}
        PrefixSum(num, N);
    }

    __syncthreads();

        // The output of PrefixSum(num, N) can now be used to realize the last task (put the "coordinates" of 
        //each 16 in the two arrays ("listM" and "listN") and each thread starts at the position equal the 
        //number of counted 16s).
    for (i = 0; i < M; i++)
    {
        if (a[Idx][i] == 16)
        {
            listM[num[Idx] + i] = Idx;
            listN[num[Idx] + i] = i;
        }
    }
}

Solution

Is it possible to write the kernel so that every single thread dynamically enlarges the two arrays "listeM" and "listeN" after counting the number of elements with the value 16 in its row (just this number)?

It's not possible for CUDA device code to enlarge an existing allocation that was created with host-side cudaMalloc, cudaMallocManaged, cudaHostAlloc, or similar.

It is possible for CUDA device code to create new allocations using in-kernel new or malloc, however data from such allocations cannot be directly transferred back to the host. To transfer it back to the host would require a host-side allocation that the data from such allocations could be copied into, which brings you back to the original issue.

Therefore there really is no convenient way to do this. Your choices are:

(Over-)allocate the needed sizes based on the maximum possible returned size.
Create an algorithm that runs a kernel once to determine the needed size, returns that size to the host. The host then allocates that size and passes it to a kernel for use, on a second invocation of the algorithm, that does the actual desired work.

A "possible" third approach would be:

Run the algorithm just once, and have the kernel allocate in-kernel to provide the needed additional space. This space is not accessible to the host side operations. This kernel would also return the size and/or arrangement of such allocations.
Based on the returned size/arrangement of the device size allocations, the host would allocate new memory of needed size.
The host would then launch a new "copy kernel" that would copy the data from the device-side allocations from step 1 to the host-side allocations provided in step 2.
The host would then copy the data from the host-side allocations in step 2, to host memory.

That's an extreme level of complexity for such a trivial problem as you have outlined, where the obvious solution is just to overallocate the needed space and be done with it.